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This study wss designed to~explor6 th6 practic3lity. fl6xibility. reliability. 
sensitivity of the Criterion Group Method or measuring the effectiveness and 
efficiency of indexing, a method using a criterion group to set the standard tor 
"ideal“ indexing. These maor variables were examined: ( 1 ) size of document samf^e. 
size of Criterion Group. ( 3 ) instructions to indexers and use of a vocabulary guide. (4) 
three methods of editing raw indicia to make terms comparable, and ( 5 ) two methods 
of weighting indexers’ scores. Scores earned by a set of eight professional indexers, 
by individual authors of the test documents and. In some cases, scores for title sets 
or medical students’ Indexing were compared within selected treatments to measure 
the extent to which the detectability of differences was achieved by each treatment. 
A two-way analysis of variance was used to relate reliability of test scores to 
document sample size and criterion group size. From the results of these studies of 
the methodologic variables, it was concluded that the criterion group method ot 
evaluating Indexing can be a practical yardstick for a wide variety of man^^ial, 
research, and educational uses. Appendixes Include the rationale of the method, a 
literature review. Information on materials employed and subjects participating In 
study trials, and details on manual and computer Implementation. (Author/ JB) 
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SUMMARY 



The criterion group method tests the effectiveness and efficiency of 

TEST INDEXING SETS, USING A CRITERION GROUP TO SET THE STANDARD FOR "iDEAl" 

INDEXING. The criterion group for a particular application" is chosen by 

THE TEST ADMINISTRATOR, CONSISTENT WITH HIS OWN CONCEPT OF WHO REPRESENTS 

THIS "ideal". Matching test sets of indexing terms with the criterion set 

YIELDS AS MANY DEGREES OF MATCH AS THERE ARE MEMBERS OF THE CRITERION 
GROUP (referred TO AS CONCENSUS NUMBER). 

This study was designed to explore the practicality, flexibility,- 

RELIABILITY, AND SENSITIVITY OF THE METHOD. 'To DO THIS, IT EXAMIfiJES 
THE MAJOR variables; (I) S I Z-E OF THE DOCUMENT SAMPLE, (2) SIZE OF THE 
CRITERION GROUP, (3) EFFECT OF VARIOUS INSTRUCTIONS TO I NDE)^E,RS AND USE 
OF A VOCABULARY GUIDE, (4) EFFECTS OF THREE METHODS OF EDITING RAW IN- 
DICIA TO MAKE TERMS COMPARABLE, AND (5) TWo' A LTERNAT I VE METHODS OF WEIGHT- 
ING indexers' scores. 

Scores earned by a set of eight professional, indexers, by individ- 
ual AUTHORS OF THE TEST DOCUMENTS, AND IN SOME, CASES .SCORES ^FOR TITLE 
SETS ^OR'mEDI CAL, STUDENTS ' INDEXING, WERE COMPARED WITHIN SELECTED TREAT-' 
MENTS TO MEASURE THE - EXTENT TO WHICH THE DETECTABILITY OF DIFFERENCES 
WAS ACHIEVED BY EACH TREATMENT. A TWO-WAY ANALYSIS OF VARIANCE WAS USED 
TO RELATE RELIABILITY OF TEST SCORES TO DOCUMENT SAMPLE SIZE AND CRITER- 
ION GROUP SIZE. 

Results with regard to practicality show that "indicative" tests 
(allowing confidence limits of + 10 points) at- the 8 o ^ level of confi- 
dence CAN BE MADE WITH DOCUMENT SAMPLES AS SMALL AS 10 AND CRITERION 
GROUPS AS SMALL AS 4j 95^ CONFIDENCE REQU I RES COMPARABLE VALUES OF 20 
DOCUMENTS AND 9 CRITERION GROUP MEMBERS. |T IS POSSIBLE TO CONDUCT 
TESTS WITH ONLY A FEW "mECHANICAl" INSTRUCTIONS TO INDEXERS, NO VOCABU- 
LARY GUIDE, NO EDITING AND NO WEIGHTING DURING SCORING OTHER THAN USE 
OF THE CONCENSUS NUMBER EVEN THOUGH FROM THE STANDPOINT OF SENSITIVITY, 

THE METHOD CAN DETECT DIFFERENCES IN SCORES DUE TO EDITING METHOD, IN- 
STRUCTIONS TO INDEXERS, USE OF A VOCABULARY GUIDE, OR WEIGHTING METHOD, 
SHOULD SUCH DETECTION BE DESIRABLE. ,ThE METHOD IS FLEXIBLE IN THAT IT 
HAS BEEN SHOWN TO LEAVE AS OPTIONS VARIABLES SUCH AS METHOD OF INSTRUCT- " 
ING INDEXERS, METHOD OF EDITING AND METHOD OF WEIGHTING. RELIABILITY .I.S 
PRIMARILY DEPENDENT ON THE SIZE OF DOCUMENT SAMPLE AND CRITERION GROUP, 

AS IS DEMONSTRATED GRAPHICALLY IN THE PAPER. FaCE VALIDITY, OR INTUITIVE 
FEEL FOR THE MEANING OF TEST RESULTS, IS ENHANCED BY THE FACT THAT DIF- 
FERENCES IN SCORES CAN BE EQUATED WITH DIFFERENCES IN THE INTERNATIONALLY- 
KNOWN MEASURE OF "rECALl", AND SCORE DIVIDED BY THE NUMBER OF TERMS IN 
THE INDEXING SET YIELDS A RESULT SOMEWHAT ANALOGOUS TO "PRECISION". PER- 
CENT MAXIMAL SCORE CAN ALSO BE EASIER TO ENVISION AS REFLECTING EFFECTIVE- 
NESS OF TEXT INDEXING SETS ON A 0-100 SCALE, WITH DIFFERENCES OF FROM 
6-8 POINTS REPRESENTING SIGNIFICANCE, WHEN SIGNED-RANK TESTS ARE APPLIED. 
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I NTRODUCT I ON 

Aim of Study 

Several years ago, for a special research application-, we 

DEVELOPED A METHOD FOR MEASURING THE "QUALITy" OR "EFFECTIVENESS", 
OF INDEXING. * AT THAT TIMEVE FE.LT THIS METHOD COULD BE 
ADAPTED FOR A WIDE RANGE OF MANAGERIAL, EDUCATIONAL, AND RESEARCH 
APPLICATIONS AND, SINCE IT HAD SEVERAL IMPORTANT ADVANTAGES OVER 
OTHER METHODS,# IT MIGHT FILL THE CRITICAL NEED FOR A PRAC- 
TICAL YARDSTICK TO EVALUATE INDEXING AND SUBJECT CATALOGING. 

However, there were a number of questions to be answered before 

ONE COULD BE CERTAIN THAT THE METHOD MET THE DEMANDING REQUIRE- 
MENTS FOR SUCH A YARDSTICK. The PRESENT STUDY WAS UNDERTAKEN TO 
EXPLORE THESE QUESTIONS. 

MeTHODOLOGI C DeS I DERATA 



For a truly general method, applicable to many types of index- 
ing AND SUBJECT CATALOGING AND SUITABLE FOR SERVING A WIDE RANGE OF 
PURPOSES, CERTAIN METHODOLOGIC CHARACTERISTICS WOULD SEEM TO BE 
EITHER ESSENTIAL OR HIGHLY DESIRABLE. FIRST, THE METHOD SHOULD 
HAVE "face" validity IN THE EYES OF THOSE WHO W.l LL USE THE RESULT- 
ING measurements; and since individuals have varying concepts of 

WHAT CONSTITUTES " I DEAL" INDEXING, THE METHOD SHOULD ALLOW ONE 
THE OPTION OF CHOOSING A 'CRITERION CONCEPT THAT REFLECTS HIS OWN 
VALUES RATHER THAN BEGGING THE QUESTION OF WHAT THE "rIGHT" CON- 
CEPT IS BY BUILDING IT INTO THE METHOD. SECOND,. THE METHOD SHOULD 
BE PRACtiCAL, IN TERMS OF TIME AND EFFORT REQUIRED; FOR ROUTINE OR 
EVERYDAY USE BY SMALL AND LARGE SERVICES AS WELL AS FOR ONE-TIME 
STUDIES AIMED AT OBTAINING "DEFINITIVE" MEASUREMENTS. ThIRD, IF 
THE MEASUREMENTS OBTAINED ARE TO SERVE AS A BASIS FOR DECISIONS, 

ONE SHOULD KNOW HOW MUCH CONFIDENCE THEY MERIT— THAT IS, THEIR 
RELIABILITY, OR REPRODUC.]B I L I T Y, SHOULD BE STATISTICALLY DETER- 
MININANT— AND THIS RELIABILITY SHOULD BE ADEQUATE TO WARRANT BAS- 
ING IMPORTANT DECISIONS ON THE MEASUREMENTS. FOURTH, THE METHOD 
SHOULD BE FLEXIBLE IN THAT IT CAN ACCOMMODATE DIFFERENT TYPES OF 
INDEXING— FOR EXAMPLE, "kEYWORD" INDEXING WITH NO RESTR I CT I ONS 
ON ALLOWABLE TERMS, SUBJECT HEADINGS CONTROLLED BY AN AUTHORITY 

* The development of this method was described in: Schultz, Claire 

K., Schultz, Wallace L., and Orr, Richard H., "Comparative indexing 

TERMS SUPPLIED BY BIOMEDICAL AUTHORS AND DOCUMENT TITLES." AMER- 
ICAN Documentation |6, 4, (October 19^5)^ pp- 299"3I^* 

# The rationale underlying the development of the method is given 
I N Appendix A. 
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LIST WITH OR WITHOUT HIERARCHICAL STRUCTURE, INDEXING DONE BY PEOPLE 
OR BY MACHINE, ETC. FiFTH, IT SHOULD BE SENSITIVE ENOUGH TO DETECT 
differences in the RELATIVE MERIT OF INDEXING PRODUCED BY THE ALTER- 
NATIVE PROCEDURES OR AGENTS THOSE USING THE METHOD MAY WISH TO’ ASSESS. 

Collectively, these five general desi derata— face validity, practi- 
cability, RELIABILITY, FLEXIBILITY, AND SENS I T I V I TY--REPRESENT A 
STRINGENT SET OF REQUIREMENTS A TRULY GENERAL METHOD SHOULD MEET. |n 
ANY PARTICULAR APPLICATION, OF COURSE, THERE MUST ALWAYS BE TRADE-OFFS 
BETWEEN VALIDITY AND PRACTICALITY, AND BETWEEN RELIABILITY AND PRAC- 
TICALITY; HOWEVER, IT SHOULD BE PpSSiBLE TO ACHIEVE COMPROMISES THAT 
ARE ACCEPTABLE. Th!S STUDY AIMED. AT EXPLORING THE METHODOLOGIC 
. VARIABLES THAT GOVERN THE TRADE-OFFS REQUIRED AND INFLUENCE THE 
METHODS FLEXIBILITY AND SENSITIVITY. 

Organ I zat i on of Report 



In the SUCEEDING sections of this report, we will DESCRIBE THE 
BASIC OPERATIONS REQUIRED TO APPLY THE GENERAL METHOD; GIVE THE 
RESULTS OF TRIALS AND ANALYSES DESIGNED TO EXPLORE CRITICAL METHO- 
DOLOGIC VARIABLES, DISCUSS THE IMPLICATIONS OF THESE FINDINGS AS 
THEY RELATE TO THE DESIDERATA SET FORTH ABOVE, AND OFFER SOME CONCLU- 
SIONS REGARDING THE METHOD'S POTENTIAL RANGE OF APPLICATIONS. FOR 

CLARITY OF PRESENTATION, ALL SUBSIDIARY DETAIL WILL BE RELEGATED TO 
THE APPENDICES. 
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ESSENTIALS OF METHOD 



In the simplest terms, the criterion-group method can be 

DESCRIBED AS FOLLOWS; pOR EACH DOCUMENT IN THE TEST CORPUS A SET 
OF TERMS CHARACTERIZING THAT DOCUMENT IS FIRST ESTABLISHED BY 
MERGING ALL TERMS CHOSEN BY THEl MEMBERS OF A CRITERION GROUP, 

EACH OF WHOM MAKES HIS CHOICES INDEPENDENTLY, Th I S INDEXING 
SET IS THEN CON S I DERED THE STANDARD (CRITERION SEt) AGAINST WHICH 

•IN THIS METHOD THE TERMS ,N THE SETS TO BE TESTED ARE NOT SCORED ON 
BLACK-OR-WHITE SCALE-THAT IS, THEY ARE NOT SIMPLY RAt’ed'aS 
MATCHING OR NOT MATCHING" THE TERMS IN THE CRITERION SETJ OUR 
SCALE ALLOWS FOR AS MANY SHADES OF GRAY AS THERE ARE MEMBERS OF 
THE, CRITERION GROUP. CONDUCTING A TEST ReouiRES SIX BASIC OPERA- 



SeLECTING the document SAMPLE 

In any specific application of the method, the documents for 
which indexing is to be evaluated should be a representative sam- 
ple OF the document universe of interest. This sample may be 

*ny of the usual sampling procedures 

BASED ON RANDOM SELECTION. WHEN THE SAMPLE TO BE USED IS LARGE, A 
SIMPLE RANDOM SAMPLING PROCEDURE CAN BE USED; HOWEVER, FOR SMALL 
AMPLES, A STRATIFIED RANDOM SAMPLE MAY BE PREFERABLE. ■ ToR THIS 

important variable IS THE SIZE OF THE SAMPLE, 
WHICH SHOULD BE LARGE ENOUGH TO PROVIDE THE RELIABILITY NEEDED 

particular PURPOSE. On THE OTHER HAND, SINCE THE NUMBER 
OF DOCUMENTS IS A MAJOR FACTOR IN DETERMINING THE EFFORT AND 
EXPENSE OF RUNNING A TEST, THIS NUMBER SHOULD BE NO LARGER THAN 



Selectin g and instructing the criterion group 

What type of individuals should constitute the criterion group 
depends upon one's concept of "ideal-' or "standard" indexing and 
the .urpose to be served. In our original study* the aim was to 

TEST HOW WELL AUTHOR-SUPPLIED INDICIA MATCHED THE LANGUAGE OF PO- 
TENTIAL users; THEREFORE, A GROUP OF THE AUTHOR'S PEERS SERVED AS 
the criterion group. however, it MIGHT BE CONSIDERED APPROPRIA^E 



* Schultz, Claire K., Wallace L* Schultz, 
Comparative indexing: terms supplied by b 

DOCUMENT TITLES, AMERICAN DOCUMENTATION 
PP. 299-312. ■ 



AND Richard H. Orr. 
lOMEDICAL AUTHORS AND 

4 , (October, I965). 
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FOR THE CRITERION GROUP TO CONSIST OF "eXPERt" INDEXERS SELECTED 
ON SOME BASIS FOR THE QUALITY OF THEIR WORK. i'dEALLY.FROM WHATEVER 
UNIVERSE THE CRITERION GROUP IS DRAWN, THE SELECTION PROCEDURE- 
SHOULD INSURE THAT THE GROUP IS REPRESENTATIVE OF THAT UNIVERSE" 

BUT PRACTICAL CONSTRAINTS MAY REQUIRE ONE TO SETTLE FOR SELECTING 
MEMBERS OF THE GROUP BY NON-RANDOM PROCEDURES. OTHER THINGS BEING 
EQUAL, THE LARGER THE GROUP THE MORE LIKELY IT WILL BE REPRESEN- 
lyiVEJ AND A UNIVERSE THAT IS RELAT I VELY HOMOGENEOUS CAN BE 

EQUATELY REPRESENTED BY A SMALLER CRITERION GROUP THAN A UNIVERSE 

heterogeneous. The size of the criterion group, like the size 

THIS^VARUBLe'^L^a”'”'’^’ OF USING THE METHOD; THEREFORE, 

Also an important determinant of practicality. 



Another variable in this operation is how the group is instruc- 
ted TO CARRY OUT ITS TASK, INCLUDING WHETHER THEY ARE GIVEN ANY SORT 
OF A TERMINOLOGY ”gUIDe" EXPLICITLY OR IMPLICITLY INTENDED TO STRUC- 
TURE THEIR RESPONSES, 

Instructing test indexers 

In ANY APPLICATION WHERE AN INDIVIDUAL, A GROUP OF I ND I V I DUA LS, OR 
A. .MACHINE INDEXES DOCUMENTS FOR THE SPECIFIC PURPOSE OF TESTING 
THE RESULTING INDICIA, INSTRUCTIONS OR RULES ON HOW TO CARRY OUT THE 
TASK WILL HAVE TO BE GIVEN, ThESE INSTRUCTIONS MAY OR MAY NOT BE 
EQUIVALENT TO THOSE GIVEN THE CRITERION GROUP, |n APPLICATIONS 
WHERE THE INDEXING TO BE TESTED HAS BEEN PRODUCED AS PART OF AN ON- 
GOING SERVICE, THIS VARIABLE DOES NOT REPRESENT A TEST "oPTIOn”, 

Again, if one desires to generalize from the findings regarding 

THE QUALITY OF THE TESTED INDEXING TO SOME LARGER UNIVERSE, THE QUES- 
TION OF REPRESENTATIVENESS ARISES; THEN, THE METHOD OF SELECTION AND 
SIZE OF THE GROUP REQUIRE CAREFUL CONSIDERATION. 

Establishing criterion and test sets 

If either the criterion group or the test' i ndexers are allowed 
to use free language*, a decision is required on Whether their out- * 

PUT SHOULD BE EDITED, OR STANDARD I ZED, BEFORE CRITERION AND TEST SETS 

ARE compared; and if standardizing is done, what rules should be 
FOLLOWED, Without standardization, synomyms and trivial variati.ons — 

FOR EXAMPLE, SINGULAR AND PLURAL FORMS OF THE SAME TERM— WILL BE 



* OR IF MACHINE INDEXING IS TO BE TESTED 
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COUNTED AS DIFFERENT TERMS. HoV/EVER^ ANY EDITING INCREASES THE COST 
OF A test; and all human editing is prone to inconsistencies and 
BIASES THAT MAY AFFECT THE RELIABILITY AND THE VALIDITY OF TEST 
RESULTS. . ■ " 

Weighting the criterion sets 

In this method, some scheme is required for weighting the terms 

USED FOR INDEXING A DOCUMENT TO REFLECT THE CONSENSUS THAT EXISTS 
AMONG THE CRITERION GROUP WITH RESPECT TO APPROPRIATE INDEXING 
TERMS FOR THAT DOCUMENT. MANY SCHEMES COULD BE EMPLOYED, BUT PER- 
HAPS THE SIMPLEST IS TO WEIGHT EACH TERM IN THE CRITERION SET BY THE 
NUMBER OF CRITERION GROUP MEMBERS WHO USED IT TO .CHARACTERIZE THE 
DOCUMENT AND TO GIVE ANY TERM NOT USED BY AT LEAST ONE MEMBER OF THE 
CRITERION GROUP (THAT IS, ANY TERM NOT IN THE CRITERION SEt) A WEIGHT 
OF ZERO TO INDICATE ITS "UNDES I RAB I L I T y" .’ ALTERNATIVE SCHEMES CAN BE 
DEVISED THAT WILL INCREASE OR DECREASE THE EFFECT OF CONSENSUS AND 
WILL CHANGE THE "pENALTy" FOR USING TERMS THAT ARE NOT IN THE CRITER- 
ION SET. (See Appendix D for details on weighting and an example of 
AN alternative SCHEME.) 

Scoring the test sets 

The weights thus established are employed to score each test set 

BY ADDING THE WEIGHTS FOR EACH TERM IN THE SET. ThE ”rAW SCORE" 

FOR A TEST SET IS THEN STANDARDIZED BY EXPRESSING IT AS A PERCENTAGE 
OF THE HIGHEST SCORE POSSIBLE FOR THAT SET, OR THE "VARIABLE SCORE*’, 
WHICH IS DETERMINED BY THE SUM OF THE WEIGHTS FOR ALL TERMS IN THE 
CRITERION SET. ThUS IF A TEST SET SCORES 0^, IT MEANS THAT NO TERM 
IN THE SET WAS USED BY ANY MEMBER OF THE CRITERION GROUP; AND A. SCORE 
OF 100^ MEANS THAT THE TEST SET CONTAINS ALL THE TERMS USED BY THE 
CRITERION GROUP COLLECTIVELY. 

When the criterion group consists of potential users, the percent 

MAXIMAL SCORE IS ANALOGOUS TO ClEVERDON'S "reCALL" MEASURE; AND IF 
DESIRED, A SUPPLEMENTARY FIGURE OF MERIT ANALOGOUS TO HIS "PRECISION** 
MEASURE MAY ALSO BE CALCULATED BY' TAKING INTO CONSIDERATION THE FRE- 
QUENCY WITH WHICH TERMS NOT IN THE CRITERION SETS (nON-SCOR I NG OR 

"zero terms") appear, in the test sets. * (See Appendix D for de- 

TAI LS ON SCORING. ) 

* The relation of measures derived, by this method to other measures 

OF INDEXING PERFORMANCE ARE SUGGESTED IN APPENDIX A. A FULL DIS- 
CUSSION OF THESE RELATIONS IS OUTSIDE THE SCOPE OF THIS REPORT. 
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In this operation, one may wish to give some credit for test set 

TERMS THAT, ALTHOUGH NOT IDENTICAL TO TERMS IN THE CRITERION SET, ARE 
SUBSUMED BY CRITERION SET TERMS IN A GIVEN INDEXING VOCABULARY, THE 
METHOD ALLOWS THE OPTION OF DEALING WITH SUCH MISMATCHES BY "cONFOUND 
ING OR GENERIC POSTING" BEFORE SCORING THE TERM SETS. * THIS COM- 
PLICATES SCORING AND HENCE INCREASES THE COST OF A TEST; BUT IT MAY 
BE APPROPRIATE IN SOME APPLICATIONS. 



* Alternatively, generic-specific transformations as well as 

STANDARDIZATION OF SYNOMYMS MAY BE DONE IN THE EDITING OPERATION. 



FINDINGS ON METHOOOLOGIC VARIABLES 
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This study focused on six of the metHodologic variables 

SELECTED FROM THOSE IDENTIFIED ABOVE. TiESE SIX VARIABLES WERE 
SELECTED BECAUSE, FOR ^ PRIORI REASONS, WE FELT THEY COULD BE MAJOR 
DETERMINANTS OF THE METHOD’S PRACTICALITY, FLEXIBILITY, RELIABILITY, 
AND SENSITIVITY, AND BECAUSE THEY COULD BE INVESTIGATED WITHOUT 
ESTABLI SHI NG. A NEW DOCUMENT CORPUS. TO EXPLORE THE EFFECTS OF THESE 
VARIABLES, WE CARRIED OUT SPECIAL ANALYSES OF THE DATA OBTAINED |N 
the ORIGINAL APPLICATION OF THE METHOD AND ALSO CONDUCTED TRIALS TO 
OBTAIN NEW DATA BEARING ON THESE VARIABLES. ThE MAJOR FINDINGS ARE 
SUMMARIZED AND DISCUSSED BELOW. DETAILS ON THE MATERIALS, SUBJECTS, 
AND MANUAL AND COMPUTER PROCEDURES REFERRED TO ARE GIVEN IN THE 

Appendices. 

Variable |. -■? Size of document sample 



t 

An indication of how the reliability of test scores depends on 

THE SIZE OF THE DOCUMENT SAMPLE USED FOR A TEST IS PROVIDED BY THE 
STANDARD DEVIATION OF THE ^ MAXIMAL SCORES FOR INDIVIDUAL DOCUMENTS 
FROM THE MEAN SCORE FOR ALL DOCUMENTS IN THE TEST SAMPLE. We 
FOUND THAT THE SAMPLE STANDARD DEVIATION IS MODERATE LY- AFFECTED BY 
OTHER METHODOLOGIC VARIABLES. ASSESSING THE EFFECTS OF EACH VARI- 
ABLE ON RELIABILITY SINGLY AND IN COMBINATION WITH OTHER VARIABLES 
WAS NOT FEASIBLE* HOWEVER, THE EFFECTS OF THE TECHNIQUE USED FOR 
EDITING TERM SETS (VARIABLE 4 ) AND OF THE SCHEME EMPLOYED FOR WEI- 
GHTING BEFORE SCORING (VARIABLE 5 ) WERE EXPLORED AND- WILL BE DIS- 
CUSSED LATER IN CONNECTION WITH THESE VARIABLES. pOR THE STUDIES 
REPORTED IN THIS SECTION AND IN THE SECTION DEVOTED TO CRITERION 

GROUP SIZE (variable 2 ), THE EDITING TECHNIQUE AND WEIGHTING SCHEME 
REMAINED CONSTANT.* 

When sets of terms produced by 8 professional indexers for 

EACH DOCUMENT WERE SCORED AGAINST THE SET OF TERMS SUPPLIED BY THE 
CRITERION GROUP OF |2 COLLECTIVELY, THE STANDARD DEVIATION OF 
SCORES FOR TERM SETS AVERAGED OVER THE 8 INDEXERS FROM THE GRAND 
MEAN FOR A SAMPLE OF | 28 DOCUMENTS WAS I7 POINTS OF MAXIMAL 

score). In a sample of 32 documents, the corresponding standard 

DEVIATIONS FOR SCORES OF TERM SETS PRODUCED BY INDIVIDUAL INDEX- 
ERS RANGED FROM l6 TO 20 . POR TERM SETS SUPPLIED BY AUTHORS, THE 
STANDARD DEVIATION OF INDIVIDUAL TERM SET SCORES FROM THE MEAN FOR 
256 DOCUMENTS WAS I7 POINTS. PiGURE | GIVES THE CONFIDENCE LIMITS 
FOR MEAN SCORES BASED ON DIFFERENT SAMPLE SIZES WHEN THE OBSERVED 
SAMPLE STANDARD DEVIATION OF I7 POINTS IS TAKEN AS AN ESTIMATE OF 
THE STANDARD DEVIATION FOR THE DOCUMENT POPULATION FROM WHICH THE 
SAMPLES WERE DRAWN. 



* COMPUTER EDITING AND WEIGHTING SCHEME §\ WERE EMPLOYED THROUGHOUT, 
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The random variation in scores attributable to documents will 

OF COURSE, DEPEND UPON THE HETEROGENEITY OF THE DOCUMENT POPULATION 
FROM WHICH THE SAMPLE WAS DRAWN; AND SINCE THE PRESENT STUDIES WERE 
LIMITED TO OUR DOCUMENT CORPUS, ONE CANNOT SAY THE OBSERVED SAMPLE 
VARIATION IS A GOOD ESTIMATE OF THE VARIATION THAT WILL BE ENCOUN- 
TERED IN APPLICATIONS OF THE METHOD WITH OTHER DOCUMENT POPULATIONS. 

However, these findings should provide at least a rough idea'of the* 

GENERAL SIZE OF DOCUMENT SAMPLE REQUIRED IN APPL I CAT I ONS WHERE IT 
IS IMPORTANT TO MEASURE THE E-FFECT I VENE SS OF A GIVEN INDEXING "tREAT 

MENT Within specified confidence limits. It can be seen that where 
there is a need for relatively precise measurements, e.g„, within 

+5 POINTS AT THE 95^ CONFIDENCE LEVEL, SAMPLES OF ^0 TO | 00 DOCU- 
MENTS WILL PROBABLY SUFFICE UNLESS VARIATION IN THE DOCUMENT POPULA- 
TION IS CONSIDERABLY GREATER THAN IN OUR CORPUS. pOR MANY APPLICA- 
TIONS, THIS DEGREE OF PRECISION WILL NOT BE NECESSARY AND USEFUL 
RESULTS CAN BE OBTAINED WITH CONS I DERABLY SMALLER SAMPLES, — FOR EX- 
AMPLE, WHERE A ROUGH ESTIMATE [± |0 POINTS WITH 80^ CONFIDENCE) CAN 
BE USEFUL, SAMPLES OF 10 DOCUMENTS MAY SUFFICE. 

Managers of indexing services and researchers attempting to 
develop indexing "systems” often need tests to indicate Whether 

TWO INDEXING TREATMENTS GIVE SIGNIFICANTLY AND MATERIALLY DIFFERENT 

RESULTS. For such uses, tests with small samples should provide an 

ADEQUATE BASIS FOR WORKING DECISIONS ON MATTERS WHERE THE COST OF 
BEING WRONG IS NOT GREAT. ThE USE OF SMALL SAMPLE TESTS THAT TAKE 
ADVANTAGE OF THE REDUCED VARIABILITY ACHIEVED BY EMPLOYING THE SAME 
SAMPLE TO TEST TWO DIFFERENT TREATMENTS WILL BE ILLUSTRATED LATER. 

Variadle 2 —Size of criterion group 

One could assess the effect of this variable directly by see- 
ing HOW THE SCORES OF A GIVEN INDEXING TREATMENT FOR A GIVEN DOCU- 
MENT SAMPLE CHANGE AS THE NUMBER OF INDIVIDUALS IN THE CRITERION 
GROUP INCREASES. HOWEVER, WHEN SCORING IS DONE MANUALLY AND THE 
DOCUMENT SAMPLE IS OF ANY SIZE, THE WORK REQUIRED FOR EACH INCRE- 
MENT IN THE SIZE OF THE CRITERION GROUP IMPOSES SEVERE LIMITATIONS 
ON THIS APPROACH. pOR THIS REASON, IN OUR ORIGINAL PROJECT, WE 
WERE ONLY ABLE TO ASSESS THIS VARIABLE CRUDELY BY GROUPING SCORES 
BASED ON HALF OF OUR CRITERION GROUP OF 12 SCIENTISTS WITH SCORES 
BASED ON THE OTHER HALF. WiTH THE DEVELOPMENT OF A COMPUTER PRO- 
GRAM FOR SCORING, SYSTEMATIC ASSESSMENT OF THE EFFECT OF CRITERION 
GROUP SIZE BECAME FEASIBLE; HOWEVER, THE COST OF A DEFINITIVE STUDY 
WAS STILL MATERIAL SO WE CONSIDERED ALTERNATIVE APPROACHES THAT 
WOULD BE MORE ECONOMICAL AND ALSO BE USEFUL FOR UNFINISHED STUDY 
OF THE CHARACTERISTICS OF INDIVIDUAL SCIENTISTS THAT MAY INFLUENCE 
HOW EFFECTIVE INDEXING IS FOR THEM. ALTHOUGH IN THIS METHOD DEFIN- 
ITIVE SCORING OF A TEST SET OF INDEXING TERMS IS BASED ON COMPARI- 
SONS WITH A COMPOSITE CRITERION SET ESTABLISHED BY MERGING THE 
TERMS USED BY EACH MEMBER OF THE CRITERION GROUP TO DESCRIBE A 
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GIVEN DOCUMENT, WE HAVE DEMONSTRATED EMPIRICALLY THAT THE SCORE 
BASED UK A COMPOSITE CRITERION SET CAN BE USEFULLY APPROXIMATED 
UNDER CERTAIN CONDITIONS BY AVERAG I NG' SCORES FOR A TEST SET BASED 
ON INDIVIDUAL CRITERION SETS, CONSISTING OF THE TERMS USED BY EACH 
CRITERION GROUP MEMBER INDIVIDUALLY.* Th I S. SUGGESTED ANOTHER 
APPROACH TO ASSESSING THE EFFECT OF CRITER.ION GROUP SIZE UTILIZING 
analysis of VARIANCE TECHNIQUES. DETAILS OF THESE ANALYSES WOULD 

findings RELATING TO THE 

EFFECT OF CRITERION GROUP SIZE WiLl BE SUMMARIZED VERY BRIEFLY. 

These analyses indicate that an appropriate model for pre- 
sent PURPOSES IS ONE IN WHICH THE TOTAL VARIANCE IN SCORES IS 
partitioned into 3 ADDITIVE COMPONENTS ATTRIBUTABLE TO DOCUMENT 
VARIANCE, CRITERION GROUP VARIANCE, AND RESIDUAL ERROR. WhEN 

SAME^FS^lMAT^Fn^n^^ CONSTANT, THIS MODEL GIVES THE 

SAME ESTIMATE FOR DOCUMENT VARIANCE AS THAT OBTAINED BY "EXPERI- 
MENTAL OR DIRECT, DETERMINATION OF DOCUMENT SAMPLE STANDARD 
deviation reported EARLIER. WhEN DOCUMENT VARIANCE IS HELD CON- 
STANT, THE MODEL GIVES AN ESTIMATE FOR CRITERION GROUP VARIANCE 
CENTERED AROUND |2| POINTS (STANDARD DEVIATION, || POiNTs) 

The effect of sampling error attributable to this source on 

MMlrfAPF IS SHOWN in Figure 2 where the confidence 

LIMITS ARE calculated FROM THIS ESTIMATED VARIANCE. 

On A A PRIORI BASIS, ONE WOULD EXPECT CRITERION GROUP VAR- 
IANCE TO DEPEND UPON THE HETEROGENEITY OF THE POPULATION THE 
GROUP REPRESENTS. |T HAS NOT BEEN FEASIBLE TO TEST THIS HYPO- 
THESIS systematically; however, we have scored author-indexer 

IndJxfpI'*^ another criterion group-the 8 professional . 

INDEXERS. Rather surprisingly, the same estimate of criterion 

group variance was obtained. These indexers also constitute a 
relatively heterogeneous group in that their approaches to index- 
ing reflect a variety of different indexing services. 

Figure 2 indicates that, when precise estimates of indexing 

critical, THE CRITERION GROUP WILL PROBABLY 
HAVE TO BE SIZABLE IF ONE IS TO HAVE MUCH CONFIDENCE THAT THEY ARE 
adequately REPRESENTATIVE OF SOME LARGER POPULATION. FOR OUR 
ORIGINAL application OF THIS METHOD, IT WAS IMPORTANT TO INCLUDE 

*THE conditions under which scoring based ON INDIVIDUAL CRITERION 

COMPL?rAM^'u^ur^ SCORING BASED ON COMPOSITE CRITERION SETS ARE 
COMPLEX AND HAVE NOT BEEN COMPLETELY EXPLORED; HOWEVER, NUMEROUS 

trials HAVE SHOWN THAT, WHEN WEIGHTING SCHEME #| IS EMPLOYED THE 
approximation is good at least fOR TERM SETS SUPPLIED BY OUR ORIG- 
INAL CRITERION GROUP OF SCIENTISTS. 



ENOUGH PEOPLE IN THE CRITERION GROUP THAT WE COULD BE REASONABLY 
CERTAIN ANOTHER SAMPLE FROM THE USER POPULATION THEY REPRESENTED 

WOULD NOT GIVE MATERIALLY DIFFERENT SCORES FOR THE INDEXING TREAT- 
MENTS WE WANTED TO ASSESS. WITHOUT A GUIDE AS TO HOW MANY WOULD 
BE enough", we therefore MADE THE SAMPLE AS LARGE AS WE COULD 
WITHIN PRACTICAL CONSTRAINTS. |N ANY APPLICATION WHERE MEMBERS OF 
THE CRITERION GROUP ARE SUPPOSED TO REPRESENT SOME LARGE POPULATION.) 

HOW large the group should be is a critical consideration since 

THIo VARIABLE IS A MAJOR DETERMINANT OF THE OVERALL COST OF EM- 
PLOYING THE METHOD. FoR OTHER APPLICATIONS, HOWEVER, THE REPRESEN- 
TATIVENESS OF THE CRITERION GROUP IS I RRE LEVANT — FOR EXAMPLE, WHERE 
ONE CAN IDENTIFY A FEW "eXPERT" INDEXERS AND CONSIDER THE I R • "oUTPUT " 

AS A VALID STANDARD. If A CRITERION GROUP WERE SELECTED FROM THE 
BEST indexers WORKING FOR A SINGLE SERVICE, IT SEEMS REASONABLE 
TO PREDICT THAT THEIR VARIANCE WILL BE MATERIALLY SMALLER THAN THAT 
FOUND IN THE TWO GROUPS WE STUDIED AND THAT A GROUP OF 3 OR 4 WILL 
PROBABLY BE OPTIMAL. EVEN WHERE THE CRITERION GROUP IS SUPPOSED 
TO REPRESENT SOME LARGER POPULATIONS, THERE ARE NUMEROUS POTENTIAL 
applications where HIGH PRECISION IS NOT ESSENTIAL AND A CRITERION 
GROUP OF LESS THAN |0 MEMBERS WILL PROBABLY SUFF I CE — WHE RE ONLY 
ROUGH ESTIMATES ARE REQUIRED OR THE NEED IS FOR A QUICK TEST TO 

guide the kind of WORKING DECISIONS DISCUSSED IN CONNECTION WITH 
DOCUMENT SAMPLE SIZE. 

Variable 3~~Instruct ions to test indexers 

Whether the method could accommodate indexing done without 

ANY VOCABULARY GUIDE, SUCH AS THE AUTHOR- I NDEX I NG FORM EMPLOYED IN 
THE ORIGINAL APPLICATION, WAS AN IMPORTANT QUESTION CONCERNING THE 
METHOD S flexibility; AND WHETHER THE METHOD COULD DETECT DIFFEREN- 
CES IN INDEXING PRODUCED BY ASKING INDEXERS TO FOLLOW DIFFERENT RULES 
HAD A BEARING ON ITS SENSITIVITY. BOTH OF THESE QUESTIONS WERE EX- 
PLORED IN NUMEROUS SMALL-SCALE EXPERIMENTS, IN WHICH DIFFERENT TYPES 
OF SUBJECTS — I ND| VI DUALS WITH AND WITHOUT INDEXING EXPERIENCE, AND 
WITH AND WITHOUT BIOMEDICAL KNOWLEDGE — WERE ASKED TO INDEX DOCUMENT 
SAMPLES UNDER TRIAL CONDITIONS. ThE KIND OF EVIDENCE THESE EXPERI- 
MENTS PROVIDED RELATING TO THE TWO QUESTIONS CAN BE ILLUSTRATED BY 
THE RESULTS OF ONE SERIES OF EXPERIMENTS, WHICH IS SUMMARIZED IN 

Table I. With no guide and no explicit rules, the mean scores for 
Group A and Group B on the 10 documents in subsample X (32J^ vs. 

WERE, AS ONE WOULD EXPECT, NOT SIGNIFICANTLY DIFFERENT.* ThERE WERE 

*ThE SIGNED-RANK (WiLCOXOn) TEST WAS EMPLOYED TO TEST THE SIGNIFICANCE 
OF THE OBSERVED DIFFERENCE. HEREAFTER, ALL STATEMENTS CONCERNING THE 
SIGNIFICANCE OF DIFFERENCES ARE BASED ON THE SIGNED-RANK TEST IF THE 
SAME SUBSAMPLE OF DOCUMENTS WAS EMPLOYED FOR BOTH INDEXING "TREATMENTS" 
ANO THE RANK TEST (vAR I OUSLY CALLED Wl LCOXON T TEST OR THE MANN-WhITNEY 

u test; was used when the document subsamples differed. 
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Table I. Trials of Indexing Rules and Aids with Medical Student Subjects 
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NO PROBLEMS IN STANDARDIZING AND SCORING TEST SETS UNDER SUCH 
UNSTRUCTURED CONDITIONS, AND THE MEAN SCORE FOR GrOUP B REMAINED 
STABLE WHEN RULE I WAS IMPOSED FOR THEIR SECOND SUBSAMPLE (Y) — 

THIS RULE MAY BE CONSIDERED A CONTROL IN THAT IT WAS NOT EXPECTED 

TO make a difference. However, when Rule 2 was imposed for their 

THIRD subsample, A SIGNIFICANT DIFFERENCE (99^ CONFIDENCE) I N‘ MEAN 
SCORES RESULTED. ThE IMPLICATIONS OF TRIALS WITH GrOUP A ARE LESS 
CLEAR CUT SINCE, FOR COMPARISONS OF INTEREST IN THE PRESENT CONTEXT,. 
THE VARIABLES ARE CONFOUNDED. ThESE TRIALS, IN CONJUNCTION WITH 
EVIDENCE PROVIDED BY SIMILAR EXPERIMENTS WITH OTHER TYPES OF SUB- 
JECTS, INDICATE THAT THE METHOD IS INDEED FLEXIBLE ENOUGH TO ACCOM- 
MODATE INDEXING DONE WITHOUT A VOCABULARY GUIDE AND THAT IS SENSI- 
TIVE ENOUGH TO DETECT THE EFFECT OF INDEXING RULES AND INDEXING AIDS 
In ADDITION, THESE ILLUSTRATE HOW SMALL DOCUMENT SAMPLES MAY SUFFICE 
'■OR SOME APPLICATIONS. 

Variable ^--Procedures for editing criterion and test sets . 
Three different procedures for editing were assessed for their 

GENERAL EFFECTS ON TEXT SENSITIVITY. ThE PROCEDURES WERE AS, FOL- 
LOWS : 

No Editing Completely unedited test sets were compared with and 

SCORED AGAINST THE UNEDITED TERM SETS OF THE CRITERION GROUP ON A 
WORD-BY-WORD BASIS. ThIS MEANT THAT WHERE "nONSUBSTANT I VE" . WORDS, 
SUCH AS "in" and "of", WHICH WERE PRESENT IN AN UNEDITED CRITERION 
SET, MATCHED WORDS IN A TEST SET SCORING CREDIT WAS GIVEN. On THE 
OTHER HAND, NO SCORING CREDIT WAS GIVEN IF A TEST SET WORD FAILED 
TO MATCH A "SUBSTANT I VE" WORD IN THE CRITERION SET BECAUSE OF A 
SLIGHT ORTHOGRAPHIC DIFFERENCE, E.G., A WORD ENDING, 

Computer Editing The computer file of thesaurus rules edited both 

THE CRITERION SETS AND TEST SETS TO ELIMINATE MOST "nONSUBSTANT I Ve" 
WORDS AND TO STANDARDIZE WORD ENDINGS. 

Manual Editing A human editor attempted to apply to criterion and 

TEST SETS THE THESAURUS RULES INCORPORATED IN THE COMPUTER EDITING 

program; however, this was done largely by memory AND THE EDITOR 
UNDOUBTEDLY CONSIDERED A WIDER RANGE OF CONTEXTS THAN WAS AVAILABLE 
TO THE COMPUTER. FoR EXAMPLE, TERMS THAT DID NOT MATCH . BECAUSE OF 
MISSPELLING WERE CREDiTp BY THE HUMAN EDITOR. . 
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Initial exploratory trials with samples of 8 documents sugges- 
ted THAT AS COMPARED TO NO EDITING, BOTH COMPUTER AND MANUAL EDITING 
INCREASED THE METHOD'S ABILITY TO PICK UP DIFFERENCES IN SCORES 
GIVEN BY DIFFERENT INDEXING TREATMENTS OF THE SAME DOCUMENTS — FOR 
EXAMPLE, PROFESSIONAL INDEXERS VS. AUTHOR INDEXERS. HOWEVER, LATER 
TRIALS CONDUCTED WITH SAMPLES OF 32 DOCUMENTS INDICATED THAT,' IN 
THIS REGARD, ANY ADVANTAGE OF THESE PROCEDURES OVER NO EDITING WAS 
RELATIVELY SMALL. SoME OF THE CRITICAL COMPARISONS IN THE LATER 
TRIALS ARE SHOWN IN TaBLE II.* ThE PRINCIPAL EFFECT OF EDITING IS 
TO INCREASE SCORES FOR ALL INDEXING TREATMENTS AND THIS INCREASE IS 
SOMEWHAT MORE MARKED WITH MANUAL EDITING THAN COMPUTER EDITING) 
HOWEVER, THE DIFFERENCES BETWEEN MEAN SCORES FOR TWO DIFFERENT INDEX- 
ING TREATMENTS IS NOT UNIFORMLY INCREASED. |N ADDITION, THE STANDARD 
DEVIATIONS, WHICH ALSO AFFECT SENSITIVITY, ARE GENERALLY INCREASED BY 
BOTH COMPUTER AND HUMAN EDITING PROCEDURES. |T IS OF SOME INTEREST 
TO NOTE THAT THE COMPUTER PROGRAM QUITE SUCCESSFULLY SIMULATED A 
HUMAN EDITOR) THE MEAN SCORE OF ALL PROFESSIONAL INDEXER TEST SETS . 
OVER 32 DOCUMENTS WAS 3^ (STANDARD ERROR, l.l) WHEN EDITED BY COM- 
PUTER, AS COMPARED TO '^6 (STANDARD ERROR, I. 5 ) WHEN THE SAME TEST 
SETS WERE MANUALLY EDITED. 

Although the fact that without editing non-substantive words 

PRESENT IN THE CRITERION SET ARE COUNTED IN SCORING MAY OFFEND ONE’s 
INTUITIVE SENSE OF TEST VALIDITY, THE FINDINGS SEEM TO INDICATE THAT 
EDITING MAKES A RELATIVELY SMALL CONTRIBUTION TO TEST SENSITIVITY. 

Both human and computer editing is relatively costly) the former 

SHOULD BE DONE BY EXPERIENCED INDEXERS, AND THE LATTER IS DEFINITELY 
UNECONOMIC UNLESS LARGE VOLUMES OF TEST SETS ARE TO BE PROCESSED OR 
A SUITABLE THESAURUS PROGRAM HAS ALREADY BEEN WRITTEN. |F EDITING 
IS OMITTED, THE REMAINING OPERATIONS CAN BE CARRIED OUT BY CLERICAL • 
PERSONNEL. ThIS IS A PRACTICAL CONSIDERATION THAT MAY BE IMPORTANT 
FOR SOME APPLICATIONS. HAVING THE OPTIONS OF NO ED I T I NG, COMPUTER 
EDITING, OR HUMAN EDITING INCREASES THE. METHOD’S FLEXIBILITY AND 
RANGE OF POTENTIAL APPLICATIONS. 



* Each of the contrasts shown were already known to be significant 

FROM LARGE SAMPLE TESTS WITH 128 TO 282 DOCUMENTS BUT THE DIFFER- 
ENCES WERE OF AN ORDER THAT MIGHT POSE A "CHALLENGE" FOR SMALL SAM 
PLE TESTS. 
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Table || COMPARISONS TO ASSESS EFFECT OF EDITING PROCEDURES 

ON TEST SENSITIVITY 



Professional Indexers 
vs. Author Gets 

pVi. 

Author 

Difference 



None 

Mean SVD. 



Computer 
Mean S 7 D. 



Manual^ 
Mean S 7 d 7 



26 8 

34 II 

s* 



34 10 36 9 

37 12 54 16 

s s* 



Author vs. Ti tle Sets 



Author 

Title 

Dl FFERENCE 



(not done) (not done) 5 ^ 16 

34 17 

S* 



I 



*S = SIGNIFICANT AT 80^ LEVEL OR HIGHER; WHERE THERE IS AN ASTERISK 
THE DIFFERENCE WAS ALSO SIGNIFICANT AT THE 95^ LEVEL OR HIGHER. 

All scores are $ maximal scores (weighting Scheme I) on the same 32 

DOCUMENTS, AND THE STANDARD DEVIATIONS IN PERCENTAGE POINTS ARE 
GIVEN IN PARENTHESES BELOW EACH SCORE. STANDARD DEVIATIONS WERE 
CALCULATED AS DESCRIBED EARLIER IN THE SECTION DEVOTED TO THE EF- 
FECT OF DOCUMENT SAMPLE SIZE. 
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Variable 5 — ^ Weighting scheme for scoring 

Trials of two different weighting schemes were conducted primarily 

TO DETERMINE WHETHER TEST SENSITIVITY WAS AFFECTED BY THIS VARIABLE. 

In Scheme I, terms in test sets are weighted by the frequency with 

WHICH THEY WERE USED BY THE MEMBERS OF THE CRITERION GROUP IN DESCRIB- 
ING THE GIVEN document; WHEREAS, |N SCHEME 2, WHICH WAS THE ONE EM- 
PLOYED IN THE ORIGINAL APPLICATION, THE SQUARE OF THIS FREQUENCY IS 
USED FOR WEIGHTING. |t CAN BE SEEN THAT THE LATTER SCHEME PLACES MUCH 
GREATER EMPHASIS ON POPULAR" CRITERION GROUP RESPONSES. LIKE EDITING, 

Scheme 2 has the effect of raising the scores of most test sets and 

GENERALLY INCREASES THE STANDARD DEVIATIONS IN COMPARISON WITH SCHEME | 
HOWEVER IT ALSO COMMONLY INCREASES THE DIFFERENCES BETWEEN MEAN SCORES 
FOR DIFFERENT INDEXING TREATMENTS. ThE RESULTING EFFECT ON TEST SEN- 
SITIVITY IS COMPLEX. As ONE EXAMPLE, WITH SCHEME | MEAN SCORES FOR 
COMPUTER-EDITED PROFESSIONAL INDEXER SETS VS. AUTHOR SETS ARE 34 

(S.D., 10) VS. 37 (S.D., 12); WHEREAS, WITH SCHEME 2 THE CORRESPOND- 
ING VALUES ARE 49 (s.D., |6) VS. 59 (s.D., |8).. FOR THIS CONTRAST, 

THE ADVANTAGE OF SCHEME 2 IS APPARENT, BUT NOT MARKED. On THE OTHER 
HAND, FOR THE CONSTRAST BETWEEN MANUAL-EDITED AUTHOR SETS VS. TITLE 

SETS, Scheme 2 is greatly superior — 54 (s.d., |6) vs. 34 (s.d., i7) 

AS COMPARED TO 74 (s.D., |6) VS. 35 (s.D., 26). SINCE WEIGHTING BY 
Scheme 2 entails a relatively small increment in effort over what 

IS REQUIRED WITH SCHEME |, |T MAY BE A USEFUL OPTION IN SOME CIRCUM- 
STANCES. 



We considered weighting schemes that would "penalize" overassign- 
ment OF TERMS more HEAVILY THAN EITHER SCHEME | OR SCHEME 2,* FOR EX- 
AMPLE, BY GIVING A NEGATIVE WEIGHT TO TERMS IN TEST SETS THAT WERE NOT 
USED BY ANY NUMBER OF THE CRITERION GROUP. HOWEVER, THE SCHEMES CON- 
SIDERED had numerous technical disadvantages; and since the % MAXIMAL 

■SCORE DIVIDED BY THE TOTAL NUMBER OF TERMS iN THE TEST SET CAN SERVE 

AS A MEASURE OF INDEXING EFFICIENCY, AS CONTRASTED WITH EFFECTIVENESS, 
THIS MATTER HAS NOT BEEN PURSUED FURTHER. 
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Variable 6 — Confounding before scoring 



As AN EXPLORATORY TRIAL OF THE EFFECT OF CONFOUNDING, I.E., 
GENERIC POSTING, ALL PROFESSIONAL INDEXER AND AUTHOR SETS FOR 32 
DOCUMENTS WERE RESCORED AFTER ADDING TO EACH TEST SET ANY TERMS 
SHOWN BY THE VOCABULARY GUIDE AS GENERIC TO TERMS IN THE ORIGINAL 

TEST SET, Scoring credit was then given to such added terms when 

THEY MATCHED CRITERION SET TERMS. CONFOUNDING INCREASED THE GRAND 
MEAN FOR THE PROFESSIONAL INDEXER SETS BY 6 POINTS, AND THE MEAN 
FOR AUTHOR SETS WAS ALSO INCREASED BY 6 POINTS; IN BOTH CASES, THE 
STANDARD DEVIATION WAS UNCHANGED. THESE FINDINGS SUGGEST THAT CON- 
FOUNDING HAS LITTLE OR NO EFFECT ON TEST SENSITIVITY, WHICH WAS THE 
MAIN QUESTION PROMPTING THE TRIAL. CONFOUNDING, HOWEVER, MAY HAVE 
AN ADVANTAGE FOR CERTAIN APPLICATIONS, E.G.,, IN TESTS IN THE CON- 
TEXT OF AN INDEXING SYSTEM THAT EMPLOYS ' H I ERARCH I CA L STRUCTURE, 

AND WHERE IT MAY BE DESIRABLE TO MAKE MORE COMPARABLE INDEXING 
DONE AT DIFFERENT LEVELS OF SPECIFICITY.’ |F AN HIERARCHICAL ORGAN- 
IZATION OF INDEXING TERMINOLOGY HAS BEEN CREATED PRIOR TO SCORING, 
THE PROCESS CAN BE CARRIED OUT DURING EITHER MANUAL OR COMPUTER 
SCORING AT A RELATIVELY LOW COST. CoNFOUNDING THEREFORE REPRESENTS 
A USEFUL OPTION THAT INCREASES THE METHOD'S FLEXIBILITY. 



CONCLUSIONS 

From the results of these studies of the methodologic variables, 

WE HAVE CONCLUDED THAT THE' CONSENSUS-GROUP METHOD OF ■ EVALUAT I NG 
INDEXING CAN BE A PRACT I CAL YARDST I CK FOR A WIDE VARIETY OF 
MANAGERIAL, RESEARCH, AND EDUCAT I ONAL USES , 





IT 



appendix a 



RATIONALE OF METHOD AND LITERATURE REVIEW 

A MODEL OF INFORMATION RETRIEVAL 

The FOLLOWjjWG diagram which has BEEy FREELY ADAPTED FROM KyLE*^* 

AND HySLOP^ represents A SIMPLIFIED" MODEL OF THE CHAIN OF PRO- 
CESSES IN AN "information RETRIEVAL" SYSTEM. Th I S MODEL CAN 
ACCOMODATE ANY SYSTEM IN WHICH DOCUMENTS ARE INDEXED PRIOR TO THE 
RECEIPT OF QUERIES, WHETHER THE INDEXING IS DONE BY PEOPLE, MACHINES 
OR MAN-MACHINE COMBINATIONS. ^ 




* In the following pages, only one reference is cited on most points; 

OFTEN SEVERAL OTHER REFERENCES WOULD BE EQUALLY APPROPRIATE. SELECTION 
OF THE ONE USED AS AN EXAMPLE WAS USUALLY FORTUITOUS AND IS NOT MEANT 
TO IMPLY ATTRIBUTION OF PRIORITY OR NOVELTY. 

^ The OPERATIONS OF CODING, FILING, AND MATCHING ARE OMITTED HERE; THE 

short-circuit" is SYMBOLIZED BY THE DOTTED LINE BETWEEN THE TOPMOST 
BOXES. 
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In this model, inde^cing is depicted as a two-step process 
(circles labelled I AND 2). The operations performed during the 

FIRST STEP, SELECTION, DETERMINE WHICH "ASPECTS” OF THE DOCUMENT 
WILL BE REPRESENTED IN THE INDEX. (ThIS ST|P IS ALSO CALLED 

"concept analysis" "document analysis" } "detection J,.and 

VARIOUS OTHER NAMES.) ThE OUTPUT OF SELECTION IS A SET OF ENTRY 

TERMS. (Synonyms, for entry terms include, among others, "entry 
expressions" ^3, "detection terms" I, "clue words''®^, "indicator 
words" and- "candidate terms"^^ .) In the second step, the 

ENTRY TERMS ARE TRANSLATED INTO A SET OF INDEX TERMS. (ThIS TRANS- 
LATION IS COMMONLY REFERRED TO AS "STANDARDIZATION" OR "VOCABULARY 

control".) The distinction between the two steps was aptly des- 
cribed BY Kyle as the difference between "what to index" and "How 
TO index it". I” 

. . In systems where no attempt is made to control the number of 

DIFFERENT TERMS THAT MAY APPEAR IN THE I ^DEX (foR EXAMPLE, SYSTEMS 
USING KWIC INDEXING, OR "pURE" UnITERMS thE TRANSLATION STEP 

IS, OF COURSE, missing; entry terms are index terms . Since few 

SYSTEMS REQUIRE INDEXERS TO RECORD ENTRY TERMS ROUTINELY, INDEXING 
MAY ALSO APPEAR TO BE A ONE-STEP PROCESS IN MANY SYSTEMS WHERE INDEX . 
TERMS ARE CONTROLLED. IN SUCH CASES ,■ HOWE VER , IT IS REASONABLE TO 
POSTULATE THAT THE TWO STEPS OCCUR IN THE INDEXER'S MIND, EVEN 
THOUGH THERE IS EVIDENCE TO SUGGEST THAT PROFESSIONAL INDEXERS MAY 
SOMETIMES THINK DIRECTLY IN CONTROLLED INDEXING LANGUAGE WHEN DE- 
CIDING WHICH ASPECTS OF A DOCUMENT SHOULD BE REPRESENTED IN THE 

INDEX. Despite the fact that it is often difficult to sepa- 

rate CLEANLY THE SELECTION AND TRANSLATION STEPS, THE DISTINCTION 
IS VERY USEFUL IN ANALYZING THE INDEXING PROCESS BECAUSE SELECTION 
POSES THEORETICAL AND PRACTICAL PROBLEMS OF A DIFFERENT ORDER OF 
DIFFICULTY THAN THOSE OF TRANSLATION. 



The IMPORTANCE OF THE SELECTION STEP 

ClEVERDON ^ AND OTHERS^^ HAVE POSTULATED THAT, GIVEN A WELL- 
DEVELOPED "indexing language," * THE TRANSLATION STEP CAN BE 

reduced to a clerical or machine-like routine; whereas, the selec- 
tion STEP IS AN INTELLECTUAL TASK. 'ThE FACT THAT TRANSLATION WAS 
SUCCESSFULLY AUTOMATED IN 1 963 , 3^ AND THAT COMPUTER PROGRAMS TO 
ACCOMPLISH THE TRANSLATION STEP HAVE SINCE BEEN INTEGRATED INTO 
SEVERAL OPERATING SERVICES, ^^^33as WELL AS BEING DEMONSTRATED (aS 
CONTRASTED TO SIMULATED) IN EXPERIMENTAL TRIALS, SUCH AS, ArTANDI'S^ 
INDICATED THE VALIDITY OF THE POSITION THAT THE PROBLEMS OF SELEC- 
TION ARE OF A DIFFERENT ORDER THAN THOSE OF TRANSLATION. 



* As A MINIMUM, AN INDEXING LANGUAGE INCLUDES A SET, OR VOCABULARY, OF 
ENTRY terms; A SET OF INDEX TERMS; A.'ID RULES FOR TRANSLATING FROM ONE 
SET TO THE OTHER. INDEXING LANGUAGES MAY HAVE VARIOUS OTHER ELEMENTS, 
BUT THESE THREE ARE ESSENTIAL. 
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In addition to being a more challengi ngly difficult step, the 

EFFECTIVENESS WITH WHICH SELECTION IS CARRIED OUT FIXES. THE UPPER 
LIMITS ON THE PERFORMANCE OF THE ENTIRE CHAIN OF PROCESSES IN AN 
INFORMATION RETRIEVAL SYSTEM. AgAIN, IT WAS CLEVERDON WHO, WHETHER 
HE WAS THE FIRST TO DO SO OR NOT, CAN BE CREDITED FOR EMPHASIZING 
THIS IMPORTANT POINT AND MAKING IT CONVINCING. He POINTED OUT .THAT 
THE MAXIMUM PERFORMANCE ANY GI.VEN SYSTEM IS CAPABLE OF, WITH REGARD 
TO "recall" and "precision" # DEPENDS UPON HOW COMPLETELY AND SPEC I - 
F I CALLY ALL THE "CONCEPTS" IN THE DOCUMENTS HAVE BEEN INDEXED. O 

Since the completeness (or "exhausti vity") and the specificity of 

THE I NDEX terms FOR A DOCUMENT CAN BE LESS, BUT NO GREATER, THAN THE 
COMPLETENESS AND SPECIFICITY OF THE ENTRY TERMS SELECTED FOR THAT 
DOCUMENT, IT FOLLOWS THAT HOW THE SELECTION STEP IS DONE DETERMINES 
THE HIGHEST LEVEL OF PERFORMANCE A GIVEN SYSTEM CAN PROVIDE -- VARIA- 
TIONS IN THE EFFECTIVENESS OF THE TRANSLATION STEP, OF THE INDEXING 
.LANGUAGE ITSELF, AND OF ALL OTHER PROCESSES AND COMPONENTS IN THE 
SYSTEM CAN ONLY LOWER SYSTEM PERFORMANCE BELOW THIS THEORETICALLY 
ATTAINABLE LEVEL. IN OTHER WORDS, GOOD SELECTION IS A NECESSARY BUT 
NOT SUFFICIENT CONDITION FOR GOOD PERFORMANCE. 

We have learned how to use. machi nes. i n operating systems to exe- 
cute, TIRELESSLY AND WITHOUT ERROR, THE TRANSLATIONS SPECIFIED BY AN 
INDEXING language; WE ARE BEGINNING TO LEARN HOW TO DESIGN AND USE 
INDEXING LANGUAGES SO THAT EITHER RECALL OR PRECISION CAN BE EMPHA- 
SIZED, DEPENDING ON WHAT THE REQUESTOR WANTS; AND MARKED PROGRESS HAS 
BEEN MADE IN IMPROVING CODING AND FILING, THE FINAL PROCESSES ON THE 
INDEXING SIDE OF THE INFORMATION RETRIEVAL "chAIN" (sEE FIGURE, PAGE 

A-l). Relative to the theoretical and practical advances in all 

THESE AREAS, PROGRESS APPEARS TO HAVE BEEN MUCH SLOWER IN UNDERSTAND- 
ING AND IMPROVING THE SELECTION STEP OF INDEXING. ONE CAN ARGUE THAT 
TODAY, AT THE PRESENT STATE-OF-THE-ART, SELECTION IS THE CRITICAL 
PROBLEM IN INDEXING, BOTH THEORETICALLY AND PRACTICALLY. FoR THESE 
REASONS WE WANTED TO DEVELOP AN EVA LUAT I ON .METHOD THAT FOCUSSED 
SPECIFICALLY ON THE SELECTION STEP AND COULD MEASURE ITS EFFECTIVE- 
NESS INDEPENDENT OF THE TRANSLATION STEP. 

NUMBER OF RELEVANT DOCUMENTS RETRIEVED IN RESPONSE TO A QUERY ' 

# Recall = total number of documents in the system that are relevant to 

THE QUERY 

= NUMBER OF RELEVANT DOCUMENTS RETRIEVED IN RESPONSE TO A QUERY 
TOTAL NUMBER OF DOCUMENTS RETRIEVED 

ORIGINALLY CALLED THE LATTER, "RELEVANCE RAT I o" BUT LATER 
THE SUGGESTION OF OTHERS AND CHANGED IT TO "PRECISION 



Precision 



Cleverdon 

ACCEPTED 
RAT I O" 
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Review of the criterion problem in indexing * 



When we were writing up the first use of our method for evaluating 

INDEXING, THAT IS, OUR CRITERION MEASURE, ClEVERDON'S "reCALl" AND 

"precision" ratios still had almost the status of an international 

STANDARD. AT THAT TIME WE WERE RATHER APOLOGETIC ABOUT INTRODUCING A 
NEW CRITERION MEASURE, PARTICULARLY ONE THAT HAD NOT YET BEEN VALIDATED 
AGAINST THE "uLTIMATe" CRITERION CONCEPT, WHICH IN ITS FULLEST EXPLI- 
CATION RUNS SOMETHING LIKE THIS! PERFORMANCE OF A REAL SYSTEM, IN A 
REAL ENVIRONMENT, SUPPLYING REAL DOCUMENTS, FROM A REAL COLLECTION, IN 
RESPONSE TO REAL QUERIES, PROMPTED BY REAL PROBLEMS OF REAL USERS— 
WITH PERFORMANCE RATED OBJECTIVELY ON THE BASIS OF HOW COMPLETELY THE 
SYSTEM RETRIEVES EVERY DOCUMENT IN THE COLLECTION THAT THE USER JUDGES 

AS "relevant" to his query, and how completely it relieves the user of 

THE CHORE OF WEEDING OUT DOCUMENTS HE FINDS IRRELEVANT. A NUMBER OF 
TRENDS THAT BEGAN SEVERAL YEARS AGO HAVE RECENTLY ACCELERATED, AND THE 
CRITERION "problem" HAS CHANGED MARKEDLY SINCE OUR HESITANT INTRODUC- 
TION OF A NEW MEASURE. THESE TRENDS CAN BE SUMMARIZED AS FOLLOWS: 

(1) There is growing recognition of the need for, and legitimacy 

OF, PROXIMATE CRITERION MEASURES. 

(2) In addition to good retrieval performance, as measured by 

RECALL AND PRECISION, OTHER SYSTEM DESIDERATA ARE RECEIVING MORE 
EMPHAS I S . 

(3) The UNIVERSAL APPROPRIATENESS AND GENERAL UTILITY OF RECALL 

AND PRECISION AS MEASURES CONCEPT IS 



PREC I S I ON^MEASURES, IS UNDERGOING A RAPID AND DRASTIC METAMOR- 
PHOSIS. 13. 2 , 11 , 10 

(5) On the most fundamental level,, the implicit assumpt i on. behi nd 

THE OLD "ultimate" CRITERION CONCEPT IS BEING CHALLENGED; THIS 
ASSUMPTION IMPLIES THAT EXHAUSTIVE SEARCH IS T_H£ FUNCTION TO BE 
SERVED, WHEREAS IT IS ONLY ONE OF THE SEVERAL FUNCTIONS OF IR 
SYSTEMS (e.G., alerting, BROWSING, SEARCHING FOR "eNOUGH" DOCU- 
MENTS TO MAKE A DECISION, ETC.), AND NOT NECESSARILY THE MOST 
IMPORTANT ONE. 

* Snyder’s distinctions among "criterion concept", "criterion measure", 
AND "criterion VALUe" 35 ARE USEFUL, AND WILL BE OBSERVED IN THE 
FOLLOWING DISCUSSION. 



BEING QUESTIONED MORE FREQ 




(4) The concept of "relevance", which is central to recall and 






Since the first two trends are especially pertinent to the measures 

OF PREFERREDNESS TO BE USED IN THE PROPOSED STUDY, THEY WARRANT SOME 
DISCUSSION. 

Although proximate criteria concepts have long been used for day- 

to-day QUALITY CONTROL (e.G., ACCURACY AS JUDGED BY INDEXING SUPERVI- 
SORS), AND AS A BASIS FOR MANAGEMENT DECISIONS (e.G., QUALITY OF 
INDEXING AS JUDGED BY EXPERTS IN I R, OR BY EXPERTS IN THE SUBJECT- 
MATTER OF THE collection), THEY WERE CONSIDERED A KIND OF SECOND-CLASS 
MEASUREMENT AFTER RECALL AND PRECISION GAINED WIDE ACCEPTANCE IN THE 
IR COMMUNITY AROUND 19^3* RECENTLY, HOWEVER, BASED ON CONSIDER- 

ATIONS OF EXPERIMENTAL DESIGN, SnYDER 35 HAS OFFERED CONVINCING ARGU- 
MENTS FOR THE UTILITY AND LEGITIMACY OF PROXIMATE, OR INTERMEDIATE, 
CRITERION CONCEPTS AND MEASURES. HE POINTS OUT THE NEED TO STUDY 
SEPARATELY THE DIFFERENT COMPONENTS IN AN |R SYSTEM USING SENSITIVE 
MEASURES SPECIFIC FOR THE COMPONENT BEING STUDIED. APPARENTLY HE IS 
NOT READY TO ABANDON THE OLD ULTIMATE CRITERION CONCEPT, HOWEVER, ' 

FOR HE ADDED THE PROVISO THAT ANY PROXIMATE CRITERIA SHOULD BE VALI- 
DATED AGAINST RETRIEVAL PERFORMANCE, PRESUMABLY MEASURED IN TERMS OF 
RECALL AND PRECISION. THE SAME NEEDS WERE EXPRESSED IN DIFFERENT 
WORDS BY A STUDY CONFERENCE SPONSORED BY NSF IN FEBRUARY 19^5^ WHERE 
THE CONSENSUS WAS THAT 

"for the time being, in view of the present 

STATE OF THE ART, EFFORTS TO DEVELOP AND 
TEST EVALUATION METHODS AND TO CONDUCT TESTS 
SHOULD BE CONCENTRATED ON SELECTED FEATURES 
OF DOCUMENT SEARCHING SYSTEMS IN SYSTEMS CON- 
TEXT, RATHER THAN ON TOTAL SYSTEMS*'. 

If one still holds to the old ultimate criterion concept, for 

WHICH THE PROPER MEASURES OF SYSTEM PERFORMANCE ARE EXPRESSED SOLELY 
IN TERMS OF TWO OR MORE RATIOS BASED ON THE FOUR-WAY PARTITION 
CREATED BY THE TWO DICHOTOMIES, RETR I EVED-NOT RETRIEVED AND RELEVANT- 

IRRELEVANT (or an n-way partition, if relevance is rated on some scale), 

ALL OTHER MEASURES CAN BE ONLY PROXIMATE. THE VALIDITY OF SUCH MEA- 
SURES MUST, THEREFORE, DEPEND ON THEIR HAVING SOME DEPENDABLE RELATION 
TO THE ULTIMATE CRITERION MEASURE. IDEALLY, THIS RELATION IS DEMON- 
STRATED empirically; but in practice it is often assumed to exist, 

EITHER BECAUSE THERE IS CONSENSUS THAT IT "shoULD" EXIST OR BECAUSE 
IT FOLLOWS FROM AN ACCEPTED THEORY. SoONER OR LATER, HOWEVER, MOST 
PROXIMATE CRITERION MEASURES TEND TO ACQUIRE "fACe" VALIDITY AND 
ACHIEVE AN INDEPENDENT "STATUS" THAT IS ACCEPTED IN ALL EXCEPT THE 
MOST FORMAL USAGE. ANOTHER WAY PROXIMATE MEASURES BECOME INDEPENDENT 
OF THEIR ORIGINAL REFERENCE STANDARD IS BY BEING INCORPORATED INTO A 
NEW THEORY, OR BY A REDEFINING OF CONCEPTS CENTRAL TO THE OLD ULTIMATE 
CRITERION CONCEPT OF MEASURE. AlL OF THESE MECHANISMS ARE APPARENTLY 
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AT WORK IN THE IR FIELD TODAY, AND A HOST OF CRITERION MEASURES FOR 
INDEXING QUALITY ARE ACQUIRING STATUS. 



Most of these "new" * measures employ a group of individuals whose 
collective responses establish a criterion standard against which any 
"unknown" sample of indexing is measured. These measures fall into 

THE following FOUR CATEGORIES, BASED ON THE TYPES OF INDIVIDUALS THAT 
COMPOSE THE "CRITERION GROUP". THE FOLLOWING TABLE CLASSIFIES SOME 
REPRESENTATIVE EXAMPLES OF SUCH MEASURES BUT DOES NOT INCLUDE THEM ALL! 



Composition of Criterion Group Size of Group Name of Criterion Measure 



I. Experts (authorities, etc. ) as few as I "Accuracy" 5 



I i . I NDEXERS 



1 1 1 .Authors 



2 OR MORE 



NO. REPRESEN- 
TED IN DOCU- 
MENT CORPUS 



"Consistency"?^ "pre- 
cision OF meaning" 39 



"Relevance 



H |4 



iV. U sers (simulated queries) as few as I "Relevance" 9 



REPRESENTAT I VENESS 



•• | 6 , 

17.26 



Category IV includes all measures in which the criterion group mem- 
bers ARE NOT ACTUAL SYSTEM USERS JUDGING THE RELEVANCE OF SYSTEM RESPON- 
SES TO THEIR OWN QUERIES, WHICH WERE GENERATED IN THE COURSE OF THEIR 
REGULAR WORK. THEREFORE, IT INCLUDES MEASURES IN WHICH THE CRITERION 
GROUP CONSISTS OF INDIVIDUALS WHO ARE ENTITLED TO, OR MIGHT BE EXPECTED 
TO, USE THE GIVEN SERVICE (POTENTIAL USERs), E.G., THE QUERISTS IN 
ClEVERDON's last study, ^ OR INDIVIDUALS FROM A POPULATION CONSIDERED 
COMPARABLE TO THE SYSTEM'S CLIENTELE (SIMULATED USERs). THE "fACE" 
VALIDITY OF THESE MEASURES DEPENDS UPON ONE's OPINION ON HOW CLOSELY 
THEY APPROACH REALITY. 

The methods used to calculate criterion values, and the size of cri- 
terion GROUPS, VARY WIDELY WITHIN A GIVEN CATEGORY, AS DOES THE PROBABLE 
RELIABILITY OF THE VALUES OBTAINED. |n A FEW CASES THESE MEASURES HAVE 
BEEN EMPIRICALLY VALIDATED AGAINST RETRIEVAL PERFORMANCE IN AN OPERATING 
SYSTEM. As AN EXAMPLE, IN ONE SYSTEM, WHERE AN EXPERT'S JUDGEMENT COULD 
BE TAKEN AS FINAL AND "cORRECt", BrYANT DEMONSTRATED THAT "aCCURACy" 
VALUES CORRELATED HIGHLY WITH ACTUAL RETRIEVAL PERFORMANCE, AND THAT 
CONSISTENCY VALUES CORRELATED HIGHLY WITH ACCURACY. 5 



* Some are actually old; but the condi fence with which they are used 

SEEMS NEW. 
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Another way to develop what might be considered a special type of 

A PROXIMATE CRITERION MEASURE IS TO "aDAPT" THE OVER-ALL SYSTEM CRITER- 
ION MEASURE (i.E., THE ULTIMATE CRITERION MEASURE) FOR ASSESSING SEP- 
ARATELY THE PERFORMANCE OF SOME SINGLE SYSTEM COMPONENT, SUCH AS INDEX- 
ING. In a special test system, such as Cleverdon’s or the Comparative 
Systems Laboratory of Western Reserve, ^7 this can be done by actually 

PERFORMING ALL THE OPERATIONS CONCERNED IN INFORMATION RETRIEVAL, 

KEEPING EVERYTHING CONSTANT EXCEPT THE COMPONENT UNDER STUDY.* HOWEVER, 
THIS PROCEDURE IS VERY DIFFICULT AND EXPENSIVE^ AND ONE ALTERNATIVE IS 

TO simulate" the condition of "all other things being equal" by sim- 
ply SHORT-CIRCUITING PART OF THE CHAIN. ThIS IS WHAT ClEVERDON DID IN 
HIS LAST SERIES OF STUDIES. ^ HiS "SEARCHES" WERE PERFORMED BY PAPER- 

and-pencil simulation on document-term matrices.; this simulation, 

ALTHOUGH IT WAS CARRIED OUT BY PEOPLE IN THIS STUDY WAS A CLERICAL 
OPERATION HE STATES COULD HAVE BEEN AUTOMATED.. An ESPECIALLY INTER- 
ESTING EXAMPLE OF A SHORT-CIRCUIT STRATEGY IS AN INGENIOUS METHOD 
KATTER has developed to analyze and compare DOCUMENTS OR DOCUMENT REP- 
RESENTATIONS, SUCH AS, INDEX TERMS, ABSTRACTS, EtC.^® 

The final step in streamlining the evaluation of indexing, of 

COURSE, IS TO REPLACE THE QUALITATIVE SYSTEM MODEL WITH A MATHEMATICAL 

formulation that permits quantitative predictions of retrieval PERFOR- 
MANCE GIVEN NUMERICAL VALUES FOR THE VARIABLES. ThEN ONE CAN SIMPLY 
PLUG in" the criterion VALUE FOR THE COMPONENT OR PROCESS ONE IS STUDY- 
ING AND CALCULATE THE ABSOLUTE VALUE FOR RETRIEVAL PERFORMANCE PREDIC- 
TED BY THE MATHEMATICAL MODEL OR THEORY. A NUMBER OF SUCH MODELS HAVE 
BEEN ADVANCED AS APPROPRIATE FOR AT LEAST PART OF A TOTAL SYSTEM (e.G., ■ 

Salton's and Bryant's ® , and some of these have been tested with 

VARYING DEGREES OF RIGOR. ThIS ELEGANT WAY OF EVALUATING INDEXING 
SEEMS VERY ATTRACTIVE, ONCE THE UNDERLYIN G THEORY HAS BEEN WELL TESTED] 
BUT FOR ANY OF THE CURRENT MODELS, TESTING HAS THUS FAR BEEN L I M hTd^ 

AND/OR confined TO SPECIAL CASES WHERE SOME OF THE VARIABLES CAN BE 
SAFELY IGNORED. 



* Snyder expresses the opinion that, even if one could control all 

COMPONENTS OTHER THAN THE ONE BEING STUDIED, AND THEN SEE HOW CHANGES 
IN THIS ONE COMPONENT AFFECT RETRIEVAL, THE USUAL CRITERION MEASURES 
OF OVER-ALL SYSTEM PERFORMANCE WOULD BE TOO CRUDE AND INSENSITIVE TO 
ANSWER SOME IMPORTANT QUESTIONS ABOUT FACTORS INFLUENCING INDIVIDUAL 
COMPONENTS. ( WHETHER THIS OPINION IS BASED ON THE RESULTS OF THE 

EARLY CRANFIELD STUDIES, WHICH SEEMED TO SHOW THAT THE OVER-ALL SYSTEM 
CRITERION MEASURE WAS REMARKABLY INSENSITIVE, ON THE FINDINGS OF 
OTHER STUDIES, OR ON THEORETICAL CONSIDERATIONS IS NOT CLEAR. 



The criterion-group method 




Our method tests indexing as a subsystem of the information 

STORAGE AND RETRIEVAL PROCESS, |T EMPLOYS A GROUP TO SET THE STAN- 
DARD AGAINST WHICH QUALITY IS TESTED, AS OPPOSED TO A SINGLE 

individual's' judgment. 

This criterion group can be made up of professional indexers, 

OF AUTHOR INDEXERS, OR OF DOCUMENT USERS. ThE CHOICE IS LEFT TO 
THE PERSON USING THE METHOD, ACCORDING TO HIS JUDGMENT OF WHAT CON- 
STITUTES IDEAL INDEXING. |n ITS ORIGINAL APPLICATION, USERS CON- 
STITUTED THE CRITERION GROUP. 3^ 
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MATERIALS EMPLOYED IN STUDY TRIALS 



Documents 

The documents employed in all trials came from the same corpus, 

WHICH CONSISTED OF 285 BRIEF PRELIMINARY REPORTS OF RESEARCH. ThE 
SOURCE OF THESE DOCUMENTS AND THE SELECTION OF THE CORPUS ARE DES- 
CRIBED IN THE FOLLOWING PARAGRAPHS, WHICH ARE ADAPTED FROM THE 
original report on the CRITERION GROUP METHOD. 

Each year several thousand ■ | 0 -m i nute oral papers reporting 

CURRENT BIOMEDICAL RESEARCH ARE GIVEN AT THE ANNUAL MEETING OF THE 

Federation of American Societies for Experimental Biology. This 

NATIONAL CONVENTION IS THE LARGEST MEETING FOR BIOMEDICAL SCIENTISTS 
AND THE WORK PRESENTED IS AN EXCELLENT CROSS-SECTION OF U’.S. BIOMEDI 
CAL RESEARCH. ThE FEDERATION CONSISTS OF 6 SOCIETIES, EACH REPRESEN 
TING A MAJOR, BASIC BIOMEDICAL D I SC I PL I NE — B I OCHEM I ST RY, IMMUNOLOGY, 
NUTRITION, PATHOLOGY, PHARMACOLOGY, AND PHYSIOLOGY. OnLY MEMBERS OF 
THESE SOCIETIES MAY PRESENT UNSOLICITED PAPERS AT THE FEDERATION 

Meeting^ and the speaker must submit to his society a short summary 
(225 words or less) of what he plans to say. These summaries are 

PUBLISHED IN A SPECIAL ISSUE OF FEDERATION PROCEEDINGS THAT APPE'ARS 
JUST BEFORE THE AN^!UAL CONVENTION. ALTHOUGH THE DOCUMENT SUBMITTED 
IS CALLED AN ABSTRACT , THE TERM IS A MISNOMER IN THAT THE DOCUMENT 
IS NOT USUALLY PRODUCED BY ABSTRACTING SOME PREEXISTING DOCUMENT. 

Authors most commonly prepare the summary before they have written 

THE FULL TEXT OF THEIR ORAL PRESENTATION. WhEN PUBLISHED, SUCH 
ANTICIPATORY ABSTRACTS, THEREFORE, REPRESENT PRIMARY DOCUMENTS— 
CONDENSED, PRELIMINARY REPORTS THAT. MAY OR MAY NOT BE FOLLOWED AT 
SOME LATER T | ME BY THE PUBLICATION OF A MORE DETAILED REPORT. ThE 
CORPUS FOR THIS STUDY WAS SELECTED BY TAKING EVERY IOtH DOCUMENT 
PUBLISHED IN THE I962 MEETING ISSUE OF FEDERATION PROCEEDINGS, 

VOLUME 21 , NO. 2 , MARCH-APRIL (2,854 DOCUMENTS IN ALl). |n THIS 
SYSTEMATIC SAMPLE, EACH OF THE 6 SOCIETIES IS REPRESENTED BY 9 - 1 15 ^ 

OF ALL THE DOCUMENTS SUBMITTED TO IT. 

Author-indexing form 

The auth^ors of papers given at this meeting were required to 

COMPLETE AN AUTHOR- 1 NDEX I NG FORM*’ LIKE THAT ILLUSTRATED IN FIGURE 

I. This is the form referred to as a vocabulary guide throughout 

THIS REPORT. ‘ . 
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“AUTHOR INDEX I NO FORM'* 

(The form consisted of 2 pages; here the lower part of the first 

PAGE AND THE TOP OF THE SECOND PAGE HAVE BEEN OMMITTEd).* 






Please sludy the siibjccl-calogoiy list Icfoic marking. The list will le used piiinaiily for the arrangement of the 
aoslracls and for the produelion of the subject index to the abstracts. A sceondary use will be for aid in prograiimnng. 

Place the number 'T" in the box at the left of the most speciHc category which classifies the area of your paper; 
tlic number *'2” in the box at the left of the next most specific category. Do not mark more than two categories. 

In the blanks at the end of the subject-category list, please supply four or more additional descriptive terms (words 
or short phrases) which can be used, besides the. subject categories already selected, for further classifying and indexing 
the content of your paper. The terms you supply should picfciably be nouns. Generic names of chemical compounds 
and drugs should be used, rather tlian trade names or jargon. 



jargon. 

SUBJECT CATEGORIES 



□ COl 

□ 002 

□ 003 

□ oai 

□ 005 

□ 006 

□ 007 

□ 008 

□ 009 

□ 010 

□ on 

□ 012 

□ 013 

□ 014 

□ 015 

□ 016 

□ 017 

□ 018 

□ 019 

□ 020 

□ 021 
□ 022 

□ 023 

□ 024 

□ 025 

□ 026 

□ 027 

□ 028 

□ 029 

□ 030 

□ 031 

□ 032 

□ 033 



□ 210 
□ 211 
□ 212 

□ 213 

□ 214 

□ 215 

□ 216 
□ 217 



Air.tno Acids 
Metabolism 
Nutrition 
Synthesis 
Antifjen-Antib ody 
Reactions 
. Cross Reactions 
Haptens 
ImmunoIIuor- . 
esccncc 

In Vivo Reactions 
Cellular 
Palhogetietic 
Non-specific 
Factors 
Complement 
Properdin 
Precipitation 

Diffusion 
Immunoelec- 
trophoresis 
Quantitation 



□ 0-10 

□ 041 

□ 042 

□ 043 

□ 0-14 

□ 045 

□ 0-16 

□ 047 

□ 048 

□ (M9 

□ 050 

□ 051 

□ 052 

□ 053 

□ 054 

□ 055 

□ 056 

□ 057 

□ 058 

□ 059 

□ 030 



□ 

□ 

□ 



084 

035 

086 



Antigens; Antibedios Q 061 

Antibody □ 062 

Formation □ 033 

Determinants 
Microorganisms □ 06-4 

Bacteria 

Richetisia □ 065 

Polysaccharides □ 036 

Proteins □ 067 

Toxins Q 063 

Transplantation 
Autoantihodies 
Tissue anti- 
bodies 



□ 069 

□ 070 

□ 071 

Biotosical C;;idc}ions □ 072 
Cytochromes □ 073 
Electron □ 074 

Transport • □ 075 



Coagulation 
Agents; factors 
Fibrinolysis 
Platelets 
Erythrocytes 
Destruction 
Metabolism 
Croups 

Hematopoiesis 
Hemoglobin 
Leukocytes 
Leukemia 
Plasma Proteins 
Albumin 
Globulins 
Cworage 
Body Water 
Dene 

Carbohydrates 
Chemistry 
Metabolism 
Citric acid cycle 
Glycolysis 
llexose phos- 
phate path 
Monosaccharide □ 1C8 
conversions □ 107 
Polysaccliaridcs □ 103 
Small cycles 
Photosynthesis 
Cardiovascular 
System 

Atherosclerosis 
Experimental 
Nutritional 
Pathophysiology □ 116 
Blood Flow □ 117 
Cerebral □ 118 

Coronary □ 119 



□ 037 

□ 0S8 

□ 0S9 

□ 090 . 

□ 091 

□ 092 

□ 093 

□ 094 

□ 095' 

□ 096 

□ 097 

□ 093 

□ 099 

□ 100 
□ 101 
□ 102 

□ 103 

□ 10-1 

□ 105 



□ 109 

□ no 

□ 111 
□ 112 

□ 113 

□ 114 

□ US 



Shock 

Blood Vessels 
Capillary 
exchange 
Venous return 
Wave transmis- 
sion 

Blood Volume 
Hemorrhage 
Transfusion 
Cardiac Drugs 
Cardiac Muscle 
Disorders 
Electrocardiog - . 
raphy 

Cardiac Output 
Control 
Meastirement 
CV Disease 
Edema 
Lymph 
Coll Structure; 

Function 

Active Transport 

Cell Membranes 

Cytoplasm 

Microsomes 

Mitoclrondria 

Nuclei 

Cell, Tissuo Culture 
Cell Antigens 
Metabolism 
Neoplasms 
Nucleic Acids . 
Clicmotherapy 
Bacterial 
Cancer 
Parasitologic 
Conncclivo Tissue 
Disorders 



□ 127 Site 

□ 128 Drug Metabolism 
G 129 Endocrincs 

□ 130 Adrenal Cortex 

□ 131 Adrenal Medulla 

□ 132 Anterior Pituitary 

□ 133 ACTH 

□ 134 Control of 

secretion 

□ 135 Gonadotropin 

□ 136 Somatotropin 

□ 137 TSH 

□ 133 Brain Hormones 

□ 139 ' Glucagon 

□ 140 Insulin 

□ 141 Diabetes 

mellitus 

□ 142 Mode of action 

□ 1‘13 Parathyroid 

□ 144 Posterior Pituitary 

□ 145 Diabetes 

insipidus 

□ 146 Sc.x Hormones 

□ 147 Androgens 

□ 148 Estrogens 

□ 149 Progestogens 

□ 150 Thyroid 

□ 151 Iodine 

metabolism 

□ 152 Regulation 

□ 153 Thyroxine 

□ 154 Energy Metabolism 

□ 155 Environment 

G lu6 Adaptation 

G 157 Air Pollution 

G 153 • Altitude 

G 159 Hibernation 

G 160 Hyperthermia; 

Heat 



Synthesis 


G 262 


Ncurochcmistry G 520 


Control 


G 366 


Hepatitis 


Trajisport 


G 2G3 


Pain G 321 


Disorders 


G 867 


Vitamins 


Phospholipids 


G 281 


Peripheral Nerves G 322 


Diuresis; 


G 368 


B 


Metabolism 


G 265 


Reflexes 


Diuretics 


G 369 


B„ 


Synthesis 


G 266 


Axon G 323 


Electrolyte 


G 370 


C 


Sterols 


G 267 


Conditioned 


Excretion 


G 371 


Fat-Soluble 


Metabolism 


G 268 


Spinal Cord G 324 


Glomerular 


G 372 


Folic 'Acid 


Synthesis 


G 269 


Nifregon Metebolism. 


Filtration 


G 373 


Unidentified 






ADDITIONAL DESCRIPTIVE 


TERMS 
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* This figure is reproduced from: Schultz, Claire K., Wallace L. Schultz, 

AND Richard H. Orr. Comparative "indexing: terms supplied by biomedical authors 
AND DOCUMENT TITLES. AMERICAN DOCUMENTATION l6, 4, (OCT. 1965)* P.‘299“3I2« 
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■ Thesaurus 



The thesaurus employed when criterion and test sets were 

MANUALLY STANDARDIZED HAS BEEN PUBLISHED AS THE "Gu I DE TO CURRENT 

Terminology in Biomedical Research",* which represents an indexing 

VOCABULARY OF THE FEDERATION OF AMERICAN SOCIETIES FOR EXPERIMENTAL 

Biology. The Guide lists 1,51.^ different terms consisting of one 

OR MORE WORDS, AND SPECIFIES THEIR CLOSEST EQUIVALENT IN THE INDEX- 
ING "languages" used by the National Library of Medicine, Defense 
Documentation Center, and the Division of Research Grants of Nation- 
al Institutes of Health. There is a high degree of "compatibility" 
AMONG THE INDEXING VOCABULARIES OF FASEB, NLM, DDC, AND NIH; THREE- 
QUARTERS OF ALL THE FASEB TERMS ARE READILY TRANSLATABLE INTO BOTH 
NLM and NIH LANGUAGES. 



* Schultz, Claire K., Compiler and Editor, Guide to current 

TERMINOLOGY IN BIOMEDICAL RESEARCH. FEDERATION PROCEEDINGS 

VoL. 24, NO. 4, July-August, 1965 . 
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APPENDIX C 



SUBJECTS PARTICIPATING IN STUDY TRIALS 



Criterion group 

The criterion sets employed in this study were established 
by a group of potential users of the indexing for the document 
corpus — in this case, members of the professional association 
from which the documents in this corpus were obtained, the 
Federation of American Societies of Experimental Biology, as 
described in Appendix A. The criterion group was a sample of 
the membership selected to represent the document authors' 
pears. Two active research workers from each of the six dis- 
ciplines in the Federation were selected by Dr. Milton Lee, 
Executive Officer of the Federation, on the basis of their 
recognized standing In the research community and on the 
likelihood that they would be willing to participate in the 
study. To facilitate holding a meeting at Federation Head- 
quarters in Bethesda at which the study could be explained 
and uniform instructions could be given, the original selection 
was limited to scientists in the Bethesda area (National 
Institutes of Health and Naval Medical Research Institute), 

One of the 12 scientists originally selected had to withdraw; 
he was replaced by a research worker in a pharmaceutical 
company, who was also well known in his discipline. 

Author-indexers 



As described in Appendix A, each of the authors of the 
documents in the corpus had supplied indexing terms when 
he submitted his paper. The author sets consisted of these 
indexing terms. The titles these authors had given the 
documents supplied the title sets. 

Professional indexers 



This group consisted of eight professional indexers, all 
of whom were experienced in v/orking with biomedical documents,. 
With one exception, they v/ere senvor personnel from indexing 
services or from information service departments of pharma- 
ceutical companies. The exception v/as an indexer who was 
currently working directly with biomedical scientists in a 
university setting. This group represents a sample of the 
universe of such indexers selected largely on the basis of 
friendship with the present investigators. 



Non-professlonal Indexers 






The non-professional Indexers employed In this study 
consisted of tv/o groups of second-year medical students 
from the same school, none of which had any experience In 
Indexing. Ten students comprised Group A; there were nine 
students In Group B. 



% 
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APPENDIX D 

Procedures for Manual Implementation* 



Editing and recording term uses 

A term-use matrix, similar to that shown in Table I, was created for each 
document however, only two test sets - the author and title sets are shown 
here. In. the present study, term usage by the professional indexer group and 
the non-professional indexer group were similarly recorded. Each term use by 
each of the 12 members of the criterion group and by the author was indicated 
by an X; the "presence" of a term in the title, was similarly recorded. The 
two members of each of the sta discipilinary pairs making up the criterion 
group were designated A and B. 

Weighting 

The criterion set of terms for a document consists of all the different 
terms used by members of the criterion group to describe that document; thus 
285 criterion sets were established^one for each document.. For the document 
illustrated in Table I, the criterion set contained 13 terms. Each term in a 
criterion set was assigned a weighting factor by one of two schemes — in 
Scheme #1 the weight was equal to the number of criterion group members who 
had used it to describe the given document; whereas in Scheme #2, this number 
was squared. This weighting procedure is illustrated in Table I. Note that 
the weighting of a term was not affected by whether the author had or had not 
Included it among the indicia he supplied, or by whether it was supplied by the 
document title. Terms in test sets that had not been used by at least one 
member of the criterion group we will refer to as "zero terms", since they were 
given a weight of 0. ‘ 

Scoring 

The raw score for each test set was calculated by adding the weights for 
all terms in the set. For example, for the author set shown in Table I, the 
raw score is 20 when the terms are weighted by Scheme #1, and 84 by Scheme #2. 
Since the number of terms, and the weighting of these terms, varies from one 
criterion set to another, the constraints on the raw scores also vary. To 
facilitate comparisons, v/e converted the raw scores into percentages of the 
highest score that could be awarded ("maximal, score"), i.e., the sum of the 
weights for all terms in the criterion set. For the document illustrated in 
Table I, if the author set had contained all of the 13 terms in the criterion 
set, its raw score would have equaled the maximal score for this document 
(28 by Scheme #1, 96 by Scheme #2) the actual raw score was 71% of the 
maximal score when Scheme #1 was employed, and 88% with Scheme #2. 



* This material is adapted from the published description of the first 
application of the method [American Documentation 16, 4, (October, 1965)]. 
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APPENDIX C 

COMPUTER IMPLEMENTATION 

After test indicia have been obtained it is possible to accomplish 

ANY OR ALL OF THE ADDITIONAL PROCESSING STEPS BY MEANS OF COMPUTER 
PROGRAMS, SUCH AS THOSE CONSTRUCTED FOR CARRYING OUT THIS STUDY.* 

For 'machine processing the first requirement is that the data be made 

MACHINE-READABLE, THAT IS, KEYPUNCHED, ThE INPUT FORMAT USED FOR THIS 
STUDY CONSISTED OF A DOCUMENT IDENTIFICATION, AN INDEXER IDENl.'lCA- 
TION AND ONE INDEXING "tERm”--AN ALPHABETIC OR NUMBERIC EXPRESS 1 ON--OF 
ANY LENGTH, UP TO THE CAPACITY OF A PUNCHED CARD, ThE SPECIFIC KEY- 
PUNCHING INSTRUCTIONS USED ARE GIVEN IN TaBLE E-I , 

If MACHINE-EDITING IS TO BE DONE, OR EVEN IF IT IS NOT, THE NEXT ^ 

PROCESSING STEP IS TO "tAg" THE KEYPUNCHED INDICIA SO THAT EVERY | 

"word” CAN BE IDENTIFIED WITH ITS DOCUMENT,' INDEXER, AND POSITION. i 

. WITHIN THE indexer's TOTAL RESPONSE TO THE DOCUMENT, ThE "tAGGEd" 

UNITS (words) are SORTED SO THEY WILL BE PROPERLY ORGANIZED FOR MATCH- 
ING EITHER THE THESAURUS, IF MACHINE STANDARDIZING IS DONE, OR IF EDIT- 
ING IS TO BE OMITTED, THE CRITERION SET,# 




Standardizing indicia can accomplish any of the following; 

(l) ELIMINATE WHAT ARE CONSIDERED "NONSUBSTANTIVE" WORDS, SUCH AS ^ 

CONNECTIVES OR ( 2 ) CHANGE WORD VARIANTS SUCH AS SINGULAR AND PLURAL 




FORMS, INTO A "STANDARD" FORM, OR ( 3 ) CONVERT WHAT ARE CONSIDERED 
SYNONYMiOUS EXPRESSIONS INTO A SINGLE "STANDARD" EXPRESSION, OR (k) 


* 


ADD ADDITIONAL, POSSIBLY MORE GENERIC, WORDS TO INDICIA, SUCH AS 
"carbohydrates" IN RESPONSE TO THE TERM "gLUCOSe", TO PERFORM SUCH 
TRANSFORMATION ON THE RAW INDICIA THERE MUST FIRST BE A SET OF "rEWRITe" 
RULES FOR ALL ANTICIPATED ENTRY TERMS ^ AND ALSO A PROGRAM WHICH MATCHES 




* Programs written in FORTRAN IV and PL/I for use on the IBM 360/67 com- 
puter. The investigators can make these programs available to interested 

PERSONS, 

# The criterion data will have been given the same treatment as the indi- 
cia, PRIOR TO ANY MATCHING OF THE CRITERION AND INDEXING SETS. 1 

$ Including the trivial rewrite rule that retains some terms in the same 

FORM AS WHEN ENCOUNTERED IN THE INPUT, WORDS NOT OF INTEREST CAN BE OMIT- 
TED FROM THE THESAURUS AND AUTOMATICALLY DELETED FOR REASON OF NON-MATCH, 

BUT SINCE THIS PRACTICE DELETES "nEW" WORDS OF INTEREST, IT IS A BETTER 
PRACTICE TO ACTIVELY DELETE UNWANTED TEI^MS AND "sAVE" NONMATCHING WORDS 
ON A SEPARATE LIST THAT IS PUT. OUT FOR HUMAN REACTION, WITH THIS APPRO- 
ACH IT IS ALSO POSSIBLE fO DECLINE TO MAKE THESAURAL REWRITE RULES FOR 
CERTAIN SEMANTICALLY AMBIGUOUS TERMS AND WAIT UNTIL THEIR FULL CONTEXTS 
ARE KNOWN (aT PROCESSING TIMe) TO INSTRUCT THE COMP'TER HOW TO DEAL WITH 
• SPECIFIC OCCURRENCES. 
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ENTRY TERMS WITH THE- THESAURUS, AND THEN CARRIES OUT THE THESAURUS 
REWRITE RULES AS THEY ARE ENCOUNTERED. ThE THESAURUS USED IN THIS 
STUDY EXAMINES ONE INPUT WORD AT A TIME, BUT IT ALSO EXAMINES CON- 
TEXTS, SO THAT EXPRESSIONS SUCH AS "amINO AC I Ds” OR "ciTRIC ACID 

cycle" can be retained as units. More information is given about 

THE RECENTLY DEVELOPED CONTEXT-DEPENDENT STANDARDIZING TECHNIQUE USED 
IN THIS STUDY IN. A SEPARATE PAPER. * 

The following example will serve to illustrate an example of 

STANDARDIZATION, ThE FOLLOWING PHRASE, "AmINO ACIDS IN RUMINANT 

nutrition" would be examined one word at a time, as each was encoun- 
tered IN ITS ALPHABETIC ORDER WHEN PROCESSING A LIST OF ALL INDIVI- 
DUAL WORDS MAKING UP THE INDICIA OF A TEST CORPUS. "AcIDs", ENCOUN- 
TERED FIRST, WOULD BE HELD FOR POTENTIAL "PARTNER WORDS" TO BE EN- 
COUNTERED IN THE LATER PORTION OF THE ALPHABETIC LIST. "AmINo", 
ENCOUNTERED NEXT, WOULD, BY MEANS OF ITS SEQUENTIAL "tAg" BE IDEN- 
TIFIED AS A ^partner" of ACIDS AND THE REWRITE RULE WOULD CAUSE 
AMINO acids" to BECOME THE STANDARDIZED INDEX TERM. "|n" WOULD BE 

deleted from the list as soon as encountered; "nutrition" would 

BE HELD FOR POTENTIAL MATCH WITH "PARTNER WORDs". WhEN "rUMINANT" 

was processed it would be identified as a partner word of "nutri- ■ 
TION and the rewrite RULE WOULD TRANSFORM "rUMINANT NUTRITION*' 

INTO THE STANDARDIZED TERM "aNIMAL NUTRITION*'. 

M STANDARDIZATION IS NOT DONE, NON-SUBSTANTIVE WORDS SUCH AS 

^^AND , WORD VARIANTS SUCH AS "enZYME", "enZYMEs", "enZYMAl", 
"enzymatic", and synonyms such as "heart" and "cardiac" are can- 
didates FOR MATCH. EVERY SUCH WORD IS TREATED AS UNRELATED TO THE 
OTHER WHEN THE CRITERION AND INDEXING SETS ARE MATCHED FOR SCORING. • 
As A RESULT, IF THE CRITERION SET CONTAINS, FOR EXAMPLE, "CARDIAC 

arrest and the indexing set contains "heart arrest", the index 

SET WILL NOT GAIN ANY SCORE, BECAUSE OF MISMATCH?- BUT THE SAME IN- 
DEXING SET COULD GAIN SCORE BECAUSE A TRIVIAL WORD SUCH AS "of" 

D I D MATCH IN THE TWO SETS. 

The PURPOSE of the scoring program is to perform the matching 

OPERATION, FOR ONE DOCUMENT AT A TIME, BETWEEN THE INDEXING SEt(s) 

AND THE CRITERION SET. THE PROGRAM CAN HAVE VARIOUS OPTIONS, AS 
IS TRUE FOR THE SCORING PROGRAM USED IN THIS STUDY, WHICH INSTRUCT 
THE PROGRAM TO CALCULATE SCORES FOR A SINGLE TEST SET OR GROUP OF 
SETS, FOR SINGLE DOCUMENTS OR FOR GROUPS OF DOCUMENTS. ANOTHER 
KIND OF OPT I ON^^ INSTRUCTS THE PROGRAM TO CALCULATE STANDARD DEVIA- 
TIONS, MAKE T TESTS, OR PERFORM OTHER STATISTICAL COMPUTATIONS,. 

AS THE REQUIRED DATA BECOME AVAILABLE DURING PROCESSING. ThE 
PROGRAM CAN BE USED TO PRINTOUT DETAIL ABOUT THE MATCHING PROCESSES. 



* Gopn I K, Myrna and Claire K. Schultz. Methods for thesaurus pro- 
cessing OF CONTEXT-DEPENDENT SEGMENTS IN LANGUAGE. SUBMITTED 
PUBL ICAT I ON. 



FOR 






A 

> 

^ , AND INTERMEDIATE CALCULATIONS IT PERFORMS, OR ONLY SPECIFIED 

f RESULTS, SUCH AS PERCENT OF MAXIMAL SCORE OR POINTS PER TERM. 

If all of the computer programs just described are considered 

AS A SYSTEM, WITH THE STANDARDIZATION PROCEDURE OPTIONAL, IT CAN 
BE SEEN THAT KEYPUNCHED RAW INDICIA CAN BE FED INTO THE COMPUTER, 

AND THE SCORED RESULTS OBTAINED, WITHOUT ANY MANUAL PROCESSING 
REQUIRED. 
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^ table: E- 1 

^ KEYPUNCHING INSTRUCTIONS FOR INDEXING DATA 

Write the document number in the first three columns. It is always 

A THREE DIGIT NUMBER (iT WILL RANGE FROM 00| TO 285 ) AND IS FOUND IN THE 
TOP RIGHT HAND CORNER OF THE INDEXING SHEET. Do NOT SKIP A SPACE. |N 
THE NEXT TWO COLUMNS WRITE THE NUMBER CODE OF THE PARTICULAR INDEXER. 

This will always be a two digit number (it will range from 00 to 99 ), 

AND IS FOUND IN THE TOP LEFT HAND CORNER OF THE INDEXING SHEET. Do NOT 

SKIP A SPACE. In the next column write the number of the discipline of 

THE AUTHOR OF THE DOCUMENT. ThE DISCIPLINE OF THE AUTHOR IS GIVEN ON 
THE EXTREME RIGHT OF THE FIRST LINE OF THE DOCUMENT IN THE FASEB BOOK. 

Use the following code to number the discipline: 

1. Physiology 4. Immunology 

2. Biochemistry 5* Nutrition 

3 . Pharmacology d. Pathology 

Do NOT SKIP A SPACE. In THE NEXT COLUMNS . ENTER ONE INDEXING TERM USED 
BY THAT INDEXER FOR THAT DOCUMENT. THIS WILL BE EITHER A THREE DIGIT 
NUMBER WHICH HAS BEEN CHECKED BY THE INDEXER ON THE SHEET OR A TERM 
WRITTEN IN BY THE INDEXER ON THE SPACE PROVIDED AT THE END OF THE SHEET. 

If thl indexing term is a number the finished "card will contain nine 

DIGITS WITH NO SPACES BETWEEN THEM. IF THE INDEXING TERM IS WRITTEN, 
THEN YOU MAY USE AS MUCH OF THE CARD AS NECESSARY TO RECORD IT. IF THE 
TERM IS FROM THOSE WRITTEN IN AND YOU HAVE TROUBLE READING THE HAND- 
WRITING OR ARE NOT SURE OF THE SPELLING, READ THE DOCUMENT AND MOST 
OFTEN THE TERM IN QUESTION WILL APPEAR THERE. If YOU CANNOT FIND THE 
TERM IN THE DOCUMENT, AND CANr.'CT DLCiPhLR THE HANDWRITING, KEYPUNCH YOUR 
^ BEST GUESS AND SET THE CARD ASIDE TO BE CHECKED . In SOME CASES A 

WRITTEN-IN TERM WILL CONSIST OF MORE THAN ONE WORD. If THIS IS THE 
CASE, LEAVE ONE SPACE BETWEEN WORDS. Do NOT INCLUDE ANY PUNCTUATION, 
E.G., COMMAS, PARENTHESES. D£ INCLUDE HYPHENS. If THE TERM IS A CHEM- 
ICAL FORMULA WITH SUBSCRIPTS, THEN COPY IT AS IF IT WERE ALL ON ONE 
LINE, E.G., Cp 2 BECOMES C02 . REPEAT THE SIX DIGIT DOCUMENT - 1 NDEXE R- 
AUTHOR NUMBER AT THE BEGINNING OF THE NEXT CARD AND, FOLLOWING THE 
ABOVE FORMAT, RECORD THE NEXT INDEX TERM. GREEK LETTERS SUCH AS , 

CAN BE PUNCHED A FOLLOWED BY A HYPHEN.. BeTA CAN BE PUNCHED B FOL- 
LOWED BY A HYPHEN. DeLTA CAN BE PUNCHED AS A D FOLLOWED BY A HYPHEN. 

Gamma is written out (GAMMA) followed by a hyphen. 

If a term will exceed column "jd look for a connective earlier 

IN THE TERM WHERE IT COULD BE BROKEN INTO TWO TERMS. 

Ex: Amino acid metabolism nutrition/of immature 

EMBRYONIC CHICKS FED METHIONINE 
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ABSTRACT 

This method tests the effectiveness of test indexing sets, using a criterion 

GROUP TO SET THE STANDARD FOR "iDEAl" INDEXING. ThE CRITERION GROUP FOR A PARTI- 
CULAR application is chosen by the test administrator, consistent with his own 

CONCEPT OF WHO REPRESENTS THIS "iDEAl". MATCHING TEST SETS OF INDEXING TERMS WITH 
THE CRITERION SET YIELDS AS MANY DEGREES OF MATCH AS THERE ARE MEMBERS OF THE 
CRITERION GROUP (REFERRED TO AS .CONCENSUS NUMBER). IMPORTANT VARIABLES FOR THE 

METHOD are: SIZE OF DOCUMENT SAMPLE, SIZE OF CRITERION GROUP, INDEXERS* INSTRUC- ^ 

T 1 ONS, METHOD OF EDITING RAW INDICIA TO MAKE THEM COMPARABLE, AND METHOD OF 
WEIGHTING TERM SETS FOR SCORING. 

' * ’ * ' • . * * 
Results of testing the methodologic variables for'the i r effects on reliabil- 
ity, SENSITIVITY, FLEXIBILITY AND PRACTICALITY OF THE METHOD SHOW THAT "INDICATIVE 
TESTS CAN BE MADE AT- THE 80^ LEVEL OF CONFIDENCE WITH DOCUMENT SAMPLES AS SMALL AS 
10 AND CRITERION GROUPS AS SMALL AS 4j 95/^ CONFIDENCE REQU 1 RED' COMPARABLE VALUES 
AS LARGE AS 20 DOCUMENTS AND 9 CRITERION GROUP MEMBERS. THE 3 EDITING METHODS 

tested: "none", manual, and computer, yielded different scores, but each preser- 
ved DIFFERENCES BETWEEN TEST SETS, SO WAS NOT IMPORTANT TO SENSITIVITY OR RELIA- 
BILITY, ONLY PRACTICALITY. SCORES CHANGED W 1 TR DIFFERENCES IN INDEXER INSTRUCTl‘0^ 
OR ADDITION OF A VOCABULARY GUIDE, SO THE METHOD IS SENSITIVE TO SUCH DIFFERENCE 
BETWEEN TESTS BUT CAN BE CARRIED OUT SUCCESSFULLY WITH ESSENTIALLY NO INDEXER IN- 
STRUCTION AND HO GUIDE. CONCENSUS NUMBER WAS ALMOST^S USEFUL AS CONCENSUS NUMBEl 
SQUARED FOR DETECTING DIFFERENCE IN TEST SETS; BUT THE LATTER WEIGHTING EMPHASIZEl 

"rECA.LL" value to SOME EXTENT. " 
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